Computational Nucleic Acid Coding and Feature Analysis 



Field of the Invention 



The present invention is in the field of bioinformatics, particularly as it pertains to gene 
prediction. More specifically, the invention relates to the probabilistic analysis of nucleic acid 
sequences for the determination of coding features, including determination of state probabilities 
for each nucleotide in a nucleic acid sequence, determination of coding strand, determination of 
open reading frame extent, determination of insertion and deletion location, determination of 
exon location, and determination of protein sequence. 



Advances in techniques for sequencing long stretches of genomic deoxyribonucleic acid 
(DNA) have allowed investigators to collect vast nucleic acid sequence data rapidly. These 
advances, combined with initiatives to sequence the entire human genome and the genomes of 
several other species, have created a need for the rapid identification of genes on long stretches 
of sequenced DNA. Conventional gene location techniques, such as cDNA hybridization, are 
effective at locating transcribed genes, but are time-consuming and costly. 

An alternative for locating genes on DNA that has not otherwise been analyzed for 
potential coding regions involves using statistical detection methods. Such methods 
conventionally include using probability models to predict where in a DNA sequence a gene is 
located. The theoretical nucleic acid sequence probabilities can be determined through analysis 
of known coding regions in the organism of interest. Once theoretical nucleic acid sequence 
probabilities are determined, nucleic acid sequences in unannotated regions of DNA in the same 
or a similar organism can be statistically compared to the theoretical nucleic acid sequence 
probabilities. If the similarity is sufficient, the investigator is notified that a coding sequence 
exists. Conventional cloning techniques can then be used to isolate the putative gene and check 
for transcription. 

One type of statistical detection method searches DNA by content. In such content- 
based models, highly conserved regions of DNA that are common to all genes are located. If a 
conserved region of DNA is found, then the nucleic acid sequence associated with the conserved 
region can be compared with known genes. Such comparisons, which can be done with nucleic 
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acid sequence comparison programs such as BLAST, are inefficient to run, however, and 
content-based searches therefore have limited desirability. 

A second type of statistical detection method searches DNA by signal. This type of 
searching involves using probability models to predict whether DNA fragments within a larger 
nucleic acid sequence are coding. Early searching by signal programs, such as TestCode and 
Grail, relied on statistical variations within coding regions of DNA, including codon frequency, 
local nucleic acid sequence composition, codon preference measures, heuristics based on 
oligonucleotide frequency variations, and measures of nucleic acid sequence complexity. 

Beyond simple gene detection, there is also a need for the determination of other coding 
features, such as the location of intron/exon boundaries in eukaryotic organisms and the location 
of insertions or deletions. The program GENSCAN (Burge, C. and Karlin, S. (1997) Prediction 
of Complete Gene Structures in Human Genomic DNA. J. Mol Biol 268, 78-94), for example, 
predicts exon location with local state probabilities based on oligonucleotide usage. GENSCAN, 
however, also depends on non-local nucleic acid sequence characteristics, which make the 
program very sensitive to sequencing errors and genes containing alternative splicing strategies. 

One statistical model that avoids the problems caused by dependence on non-local 
nucleic acid sequence characteristics is the inhomogeneous Markov model. An inhomogeneous 
Markov model depends upon local probabilities, and is not therefore sensitive to sequencing 
errors or genes with alternative splicing strategies. The inhomogeneous Markov model is 
"inhomogeneous" because it determines the state probabilities for a given nucleotide in muhiple 
reading frames rather than in a single reading frame. GeneMark, for example, is a computer 
program that uses the inhomogeneous Markov model to locate genes. 

The GeneMark gene prediction algorithm was developed in several steps. A series of 
three publications demonstrated that inhomogeneous Markov models were useful tools for gene 
prediction {see Borodovsky, M., Sprizhitsky Yu., Golovanov E. and Alexandrov A. (1986) 
Statistical Patterns in Primary Structures of Functional Regions in the E, Coli Genome: L 
Oligonucleotide Frequencies Analysis, Molecular Biology, 20, 826-833, Borodovsky, M., 
Sprizhitsky Yu, Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary 
Structures of Functional Regions in the E. Coli Genome: 11. Non-homogeneous Markov Models, 
Molecular Biology, 20, 833-840, Borodovsky, M., Sprizhitsky Yu., Golovanov E. and 
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Alexandrov A. (1986; Statistical Patterns in Primary Structures of.Functional Regions in the E. 
Coli Genome: III. Computer Recognition of Coding Regions, Molecular Biology, 20, 11 45-1 150, 
all of which are herein incorporated by reference in their entirety). The GeneMark method-was 
based on an inhomogeneous Markov model and was described in 1993 {see Borodovsky, M. 
5 and Mclninch J. (1993) GeneMark, Parallel Gene Recognition for both DNA Strands, Computers 
& Chemistry, 17, 123-133, and Borodovsky, M. and Mclninch J. (1993) BioSystems v30, pp. 
161-171, both of which are herein incorporated by reference in their entirety). The capabilities of 
the GeneMark program were subsequently investigated {see James D. Mclninch, Prediction of 
Protein Coding Regions in Unannotated DNA sequences Using an Inhomogeneous Markov 
m Model of Genetic Information Encoding (1 997) (Ph.D. dissertation, Georgia Institute of 
Cn Technology, on file with the Georgia Institute of Technology Library, which is herein 
incorporated by reference in its entirety). 

pi I 

•J Conventional programs using inhomogeneous Markov models, however, are limited to a 

jy defined probabilistic model for determining probability, and cannot be tailored by the 

J15 investigator to better suit the nucleic acid sequence under study if information about that nucleic 

jn acid sequence is already available. Further, conventional implementations do not allow for the 

^ efficient and accurate detection of other nucleic acid sequence features. 

C3 What is needed in the art is a method of determining state probabilities for a nucleic acid 

sequence having some known characteristics, where the method is insensitive to frameshift 
20 insertions or deletions, and compatible methods for detecting other nucleic acid sequence 

features in known or unknown nucleic acid sequences. 

Summary Of The Invention 

The present invention relates to the probabilistic analysis of nucleic acid sequences for 
25 the determination of coding features, including determination of state probabilities for each 
nucleotide in a nucleic acid sequence, determination of coding strand, determination of open 
reading fi-ame extent, determination of insertion and deletion location, determination of exon 
location, and determination of protein sequence. Described herein are methods, devices, and 
systems for analyzing the information content in nucleic acids. 
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The present invention includes and provides a method for determining a probability for 
one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 
nucleic acid sequence; b) determining transition probabilities for each of the states for 
nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
a probability for the nucleic acid sequence for each of the states; and, d) determining a 
probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 
nucleic acid sequence; b) determining transition probabilities for each of the states for 
nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
a probability for the nucleic acid sequence for each of the states; and, d) determining a 
probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
sequence, wherein the determining a probability for each of the states is capable of accepting a 
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The present invention includes and provides a method for deteraiining a probability for 
each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 
determining an initial oligonucleotide probability for each of the states for an initial 
oligonucleotide in a window of a first nucleotide; b) determining transition probabihties for each 
of the states for nucleotides within the window following the initial oligonucleotide; c) 
determining a probability for the window for each of the states; d) determining a probability for 
each of the states for the nucleotide based upon the probabihty for the window and a bias; and, e) 
repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 

The present invention includes and provides a method for determining strand coding of a 
nucleic acid sequence based upon a bias, comprising: a) determining a probability of each of one 
or more states for each nucleotide in the nucleic acid sequence, wherein each of the states is 
either a positive strand state or a negative strand state; b) summing the probabilities of the 
positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
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states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
probabilities for negative states is less than a threshold value; ii) coding is on the positive strand 
if a second function of the sum of probabilities for positive states is greater than a third function 
of the sum of probabilities for negative states and the first function is not less than the threshold 
value; and iii) coding is on the negative strand if the second function of the sum of probabilities 
for positive states is not greater than the third flmction of the sum of probabilities for negative 
states and the first function is not less than the threshold value. 

The present invention includes and provides a method for determining the extern of an 
open reading fi-ame within a nucleic acid sequence based upon a bias, comprising: a) determining 
the probability of each of one or more states for each nucleotide in the nucleic acid sequence, 
wherein each of the states is either a coding state or a noncoding state; b) determining the coding 
jy strand of the nucleic acid sequence; and, c) determining the points within the nucleic acid 
ji5 sequence in the coding strand at which the sum of the probabilities of the coding states for each 
I nucleotide drops below a first threshold value for a number of micleotides greater than a second 
I threshold value, wherein ends of the open reading fi-ame are indicated at the points. 

The present invention includes and provides a method for determining the location of 
insertions and deletions within a nucleic acid sequence, comprising: a) determining the 
probability of each of one or more states for each nucleotide in the nucleic acid sequence based 
upon a bias, wherein each of the states is either a coding state or a noncoding state; b) setting a 
length for a window; c) determining which state has a maximum mean probability for the nucleic 
acid sequence on a first side of a middle nucleotide in the window, wherein the window begins at 
a first nucleotide; d) determining which state has a maximum mean probability for the nucleic 
acid sequence on a second side of the middle nucleotide in the window; e) determining that a 
deletion or insertion occurred at the middle nucleotide if i) the state with the maximum mean 
probability on the first side of the middle nucleotide is different fi-om the state with the maximum 
mean probability on the second side of middle nucleotide, and ii) either an average of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 
average of hypothetical state probabilities for the window with a deletion at the middle 
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nucleotide is greater than a sum of the middle nucleotide's coding states probabilities; and, f) 
repeating steps c) through e) for each remaining nucleotide in the nucleic acid sequence after the 
first nucleotide, wherein the window begins at each remaining nucleotide in turn. 

The present invention includes and provides a method for determining exon location 
within a nucleic acid sequence, comprising a) determining the probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
states is either a coding state or noncoding state; b) determining the coding strand of the nucleic 
acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 
probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine a probability for each of one or more states for a nucleotide in a 
nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 
probability for each of the states for an initial oligonucleotide in the nucleic acid sequence; b) 
determining transition probabilities for each of the states for nucleotides within the nucleic acid 
sequence following the initial oligonucleotide; c) determining a probability for the nucleic acid 
sequence for each of the states; and, d) determining a probability for each of the states for the 
nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine a probability for one or more states for more than one nucleotide in a 
nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 
probability for each of the states for an initial oligonucleotide in a window of a first nucleotide; 
b) determining transition probabilities for each of the states for nucleotides within the window 
following the initial oligonucleotide; c) determining a probability for the window for each of the 
states; d) determining a probability for each of the states for the nucleotide based upon the 
probability for the window and a bias; and, e) repeating steps a) through d) for each remaining 
30 nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine strand coding of a nucleic acid sequence, the method steps 
comprising: a) determining a probability of each of one or more states for each nucleotide in the 
5 nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 
;H) first function of the sum of probabilities for positive states and the sum of probabilities for 
negative states is less than a threshold value; ii) coding is on the positive strand if a second 
function of the sum of probabilities for positive states is greater than a third function of the sum 
of probabilities for negative states and the first function is not less than the threshold value; and 
iii) coding is on the negative strand if the second function of the sum of probabilities for positive 
states is not greater than the third function of the sum of probabilities for negative states and the 
first function is not less than the threshold value. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine the extent of an open reading frame within a nucleic acid sequence, 
the method steps comprising: a) determining the probability of each of one or more states for 
each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 
sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 
25 threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine the location of insertions and deletions within a nucleic acid sequence, 
the method steps comprising: a) determining the probability of each of one or more states for 
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each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) setting a length for a window; c) determining which 
state has a maximum mean probability for the nucleic acid sequence on a first side of a middle 
nucleotide in the window, wherein the window begins at a first nucleotide; d) determining which 
state has a maximum mean probability for the nucleic acid sequence on a second side of the 
middle nucleotide in the window; e) determining that a deletion or insertion occurred at the 
middle nucleotide if i) the state with the maximum mean probability on the first side of the 
middle nucleotide is different from the state with the maximum mean probability on the second 
side of middle nucleotide, and ii) either an average of hypothetical state probabilities for the 
window with an insertion at the middle nucleotide or an average of hypothetical state 
probabilities for the window with a deletion at the middle nucleotide is greater than a sum of the 
middle nucleotide's coding states probabilities; and, f) repeating steps c) through e) for each 
remaining nucleotide in the nucleic acid sequence after the first nucleotide, wherein the window 
begins at each remaining nucleotide in turn. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine exon location within a nucleic acid sequence, the method steps 
comprising: a) determining the probability of each of one or more states for each nucleotide in 
the nucleic acid sequence based upon a bias, wherein each of the states is either a coding state or 
noncoding state; b) determining the coding strand of the nucleic acid sequence; c) determining 
the extent of an open reading frame within the nucleic acid sequence; d) classifying each 
nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
strand; e) reclassifying each nucleotide according to defined rules; and, f) determining that 
regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a computer system for determining a 
probability for each of one or more states for a nucleotide in a nucleic acid sequence, comprising: 
an input device for inputting the nucleic acid sequence; a memory for storing the nucleic acid 
sequence; a processing unit configured for retrieving the nucleic acid sequence and for: a) 
determining an initial oligonucleotide probability for each of the states for an initial 
oligonucleotide in the nucleic acid sequence; b) determining transition probabilities for each of 
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the states for nucleotides within the nucleic acid sequence following the initial oligonucleotide; 
c) determining a probability for the nucleic acid sequence for each of the states; and, d) 
determining a probability for each of the states for the nucleotide based upon the probability of 
the nucleic acid sequence and a bias. 

The present invention includes and provides a computer system for determining a 
probability for each of one or more states for more than one nucleotide in a nucleic acid 
sequence, comprising: an input device for inputting the nucleic acid sequence; a memory for 
storing the nucleic acid sequence; a processing unit configured for retrieving the nucleic acid 
sequence and for: a) determining an initial oligonucleotide probability for each of the states for 
an initial oligonucleotide in a window of a first nucleotide; b) detennining transition probabilities 
for each of the states for nucleotides within the window following the initial oligonucleotide; c) 
determining a probability for the window for each of the states; d) determining a probability for 
^ each of the states for the nucleotide based upon the probability for the window and a bias; and, e) 
f repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 
jJ5 The present invention includes and provides a computer system for determining strand 

P coding of a nucleic acid sequence, comprising: an input device for inputting the nucleic acid 
g sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
G retrieving the nucleic acid sequence and for: a) determining a probability of each of one or more 

states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
20 states is either a positive strand state or a negative strand state; b) summing the probabilities of 
the positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
25 probabilities for negative states is less than a threshold value; ii) coding is on the positive strand 
if a second function of the sum of probabilities for positive states is greater than a third fimction 
of the sum of probabilities for negative states and the first funcfion is not less than the threshold 
value; and iii) coding is on the negative strand if the second fimction of the sum of probabilities 
for positive states is not greater than the third fimction of the sum of probabilities for negative 
50 states and the first fimction is not less than the threshold value. 
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Tl.e pres», invention includes and provides a computer system for declining tte 
extent of an open teading tiante within a nucleic acid sequence, contptising: an input device for 

J«.nganuc.eicacidse,ue„ce;a.en,o,. for storingthenucieic acid se^uenceuprocessing 

-.conflgt^forretrievingt.e„uc,eicacidse,uenccandfor:a)deter„,i„i„gtheproBabi,ity„f 
each Of one or ntore states for each nucleotide in the nucleic acid sequence based upon a bias 

whe.,ne.hof«te states is citheracodi„gs,ateoran„ncodingstate;b,deter.i 
sf^d of the nuclcc acid se,uence; and, c) determining the points within the nucleic acid 
secuc^e in the coding st^td at which the sun, of the probabilities of the coding states for each 

™e drop, belowaflt^ttht^shold value foranuntber Of nucleotidesgteaterthanasecond 
threshold value, wherein ends of the open fading frame a,, indicated a, the points 

present invention includes and provides a computer system for detennining the 
ocanon Of insertions and deletions within a nucleic acid sequence, comprising: an input device 
for tnputting a nucleic acid sequence; a memo^ for storing the nucleic acid sequence, a 
proce^ingunit configured for.trievingthe„uc,eic acid se,uenceandf„r:a)deten^^^ 

P^bttyofeacbofoneormorestatesforeacbnucieotideintitenucleicacidsequenceLd 
u^nab.a,wh„emeacb of the states iseitite^ 

length for a wtndow; c) detennining which state has a ma=timum mean probability for the nucleic 

ac,d se,u.ce 3ide Of a middle nucleotide in the Window, wherei^ 

aflrst„„c,eotide;d) determining which statehasama^imummcanprobabih-tyforthenuclL 
c,d sequence on a s^ond side of the middle nucleotide in the window; e, determining tita, a 
deletion or msertion occutred at ti,e middle nucleotide if i) the state with Ute maximum mean 
probab,i,^ on the fi,.t side of the middle nucleotide is diffeten, from the state witi, the maximum 
mean probab.bty on the second side of middle nucleotide, and ii) eititer an avenge of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 
average of hyp„ti,eticd state probabilities for the ™dow with a deletion at the middle 
nucleotide is greater t^n a sum of the middle nucleotide's coding states probabilities; and t> 

™stepsc)ti™ughe)foreachremaining„ucleotide in the nucleic acidsequenceatler^^^ 
first nucleotide, whetein the window begins at each temaining nucleotide in tum 

TTe presem invention includes and provides a computer system for determimng exon 
location wititin a nucleic acid sequence, comprismg: an input device for inputting a nucleic acid 
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sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
retrieving the nucleic acid sequence and for: a) determining the probability of each of one or 
more states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of 
the states is either a coding state or noncoding state; b) determining the coding strand of the 
nucleic acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 
probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine a probability for each of one or more states for a 
nucleotide in a nucleic acid sequence, the computer program logic comprising means for 
enabling the processor to perform each of the following steps: a) determining an initial 
oligonucleotide probability for each of the states for an initial oligonucleotide in the nucleic acid 
sequence; b) determining transition probabilities for each of the states for nucleotides within the 
nucleic acid sequence following the initial oligonucleotide; c) determining a probability for the 
nucleic acid sequence for each of the states; and, d) determining a probability for each of the 
states for the nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine a probability for each of one or more states for more 
than one nucleotide in a nucleic acid sequence, the computer program logic comprising means 
for enabling the processor to perform each of the following steps: a) determining an initial 
oligonucleotide probability for each of the states for an initial oligonucleotide in a window of a 
first nucleotide; b) determining transition probabilities for each of the states for nucleotides 
within the window following the initial oligonucleotide; c) determining a probability for the 
window for each of the states; d) determining a probability for each of the states for the 
nucleotide based upon the probability for the window and a bias; and, e) repeating steps a) 
through d) for each remaining nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine strand coding of a nucleic acid sequence, the 
computer program logic comprising means for enabling the processor to perform each of the 
following steps: a) determining a probability of each of one or more states for each nucleotide in 
the nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
Clo probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 
first function of the sum of probabilities for positive states and the sum of probabilities for 
negative states is less than a threshold value; ii) coding is on the positive strand if a second 
J J function of the sum of probabilities for positive states is greater than a third function of the sum 
iU of probabilities for negative states and the first function is not less than the threshold value; and 
ji5 iii) coding is on the negative strand if the second function of the sum of probabilities for positive 
[J states is not greater than the third function of the sum of probabilities for negative states and the 
Q first function is not less than the threshold value. 

B The present invention includes and provides a computer program product comprising a 

computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine the extent of an open reading frame within a nucleic 
acid sequence, the computer program logic comprising means for enabling the processor to 
perform each of the following steps: a) determining the probability of each of one or more states 
for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 
sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 
threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
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processor in a computer system ,„ determine tl,c location of insertions and deletions witlnn a 
nucietc acid sequence. fl,e computer program logic comprising means for enabling Ute processor 
«. perform each of the following steps: a) determining the probability of each of one or mote 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
smtes ,s either a coding state or a noncoding state; b) setting a length for a window c) 
detennming which state has a maximum mean probability for dre nucleic acid sequence on a first 
s.de of a middle nucleotide in the window, wherein the window begins at a Rr^ nucleotide- d) 
detennmmg which state has a maximum mean probability for the nucleic acid sequence on a 
second side of the middle nt^leotide in me window; e) determining that a deletion or in^ion 
occurred a, d,e middle nucleotide if i) tite state witi, the maximum mean probability on tire fi,., 
s.de of tire middle nucleotide is different from the state witi, the maximum mean probability on 
tire second side of middle nucleotide, and ii) either an average of hypotiretical state p^babihties 
for the window with an insertion at ti,e middle nucleotide or an average of hypotiretical state 
protabilities for tire window with a deletion at the middle nucleotide is greater than a sum of the 
mtddle nucleotide's coding states probabilities; and, « repeating steps c) tiu^ugh e) for each 
remaining nucleotide in tire nucleic acid sequence after tire first nucleotide, wherein the window 
begins at each remaining nucleotide in turn. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer prog,^ logic recorded tirereon for enabling a 
processor in a computer system to determme exon location ™,hin a nucleic acid sequence tire 
computer program logic comprising means for enabling tire processor to perfonn each of tire 
following steps: a) determining tire probability of each of one or more states for each nucleotide 
m tire nucleic acid sequence based upon a bias, wherein each of the states is eitirer a coding state 
or noncoding state; b) determining tire coding sti^d of tire nucleic acid sequence; c) determining 
tire extent of an open reading frame within the nucleic acid sequence; d) classifying each 
nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
Sttand; e) r^lassitying each nucleotide according to defined rules; and, f) determining that 
regions of the nucleic acid sequence in tire coding class are exons. 

ne present invention includes and provides a method for determining a probability for 
one or more states for a nucleotide in a nucleic acid sequence, comprising detennining a 
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probability for each of the states for the nucleotide based upon a probability of the nucleic acid 
sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 
determining a probability for each of the states for a first nucleotide in the nucleic acid sequence 
based upon a probability of a window in which the first nucleotide is located and a bias; and, b) 
repeating step a) for the remaining nucleotides in the nucleic acid sequence. 



Description Of The Figures 

Figure 1 is a flow chart representing one embodiment of a method for determining the 
probability of each of the possible states for a single nucleotide in a nucleic acid sequence; 
;j Figure 2 is a flow chart representing one embodiment of a method for determining the 

ru probability of each of the possible states for a multiple nucleotides in a nucleic acid sequence; 
iU Figure 3 is a flow chart representing one embodiment of a method for determining the 

|J^5 coding strand of a nucleic acid sequence; 

j J Figure 4 is a flow chart representing one embodiment of a method for determining the 

C3 extent of an open reading frame within a nucleic acid sequence; 

p Figure 5 is a flow chart representing one embodiment of a method for determining the 

location of insertions and deletions within a nucleic acid sequence; 
20 Figure 6 is a flow chart representing one embodiment of a method for determining the 

extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 7 is a flow chart representing one embodiment of a method for determining the 
extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 8a is a schematic representation of a window located at the end of a nucleic acid 
25 sequence; 

Figure 8b is a schematic representation of a window located at the end of a nucleic acid 
sequence showing nucleotides near the end of the nucleic acid sequence; 

Figure 8c is a schematic representation showing the ends of a nucleic acid sequence being 
copied to form a hypothetical extension on each end of the nucleic acid sequence; 
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Figure 8d is a schematic representation of a nucleic acid sequence showing the appended 
hypothetical extensions; 

Figure 9a is a schematic representation of one embodiment of a computer system that can 
implement the methods of the present invention; 
5 Figure 9b is a schematic representation of one embodiment of a computer system that can 

implement the methods of the present invention; 

Figure 10a is a schematic representation of a genomic sequence of DNA with an aligned 
expressed sequence tag aligned thereto; 

Figure 10b is a schematic representation of a window in a region of DNA when the entire 
Jft) region is in a known coding region; and, 

I Figure 10c is a schematic representation of a window in a region of DNA when part of 

I the region is known to be coding, and part of the region is known to be noncoding. 

f Detailed Des cription Of The Invention 

jA5 Described herein are methods for determining the state probabilities of one or more 

nucleotides in a nucleic acid sequence, the coding strand of a nucleic acid sequence, the extent of 
an open reading frame in a nucleic acid sequence, the location of deletions and insertions in a 
nucleic acid sequence, the location of exons in a nucleic acid sequence, and the translation of 
those exons. Also described are program storage devices readable by a machine, tangibly 
embodying a program of instructions executable by a machine to perform the above methods. 
Also described are computer systems for implementing the above methods, comprising an input 
device for inputting a nucleic acid sequence, a memoiy for storing the nucleic acid sequence, 
and a processing unit. Also described are computer program products comprising a computer 
usable medium having computer program logic recorded thereon for enaWing a processor in a 
25 computer system to perform the above methods. 
Definitions: 

Nucleic Acid Sequence - As used herein, "nucleic acid sequence" includes a nucleic acid 
sequence of any nucleic acid as is generally understood in the art. The nucleic acid can be DNA, 
cDNA, genomic DNA, raw DNA, expressed nucleic acid sequence tags (ESTs), RNA, mRNA 
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unprocessed RNA, processed RNA, or any other form of nucleic acid, regardless of whether or 
not the nucleic acid actually codes for a protein. 

Nucleic acid sequences can be derived from any natural or artificial source, including 
prokaiyotic and eukaryotic organisms, and can be at any stage of processing. 

It is understood by those skilled in the art that any representation of a nucleic acid 
sequence is contemplated herein and within the scope of the present invention. That is, while 
conventionally nucleic acid sequences are represented by the nucleotide or base letters A, T G 
C, U, any alphanumeric or other representation of nucleotide or base nucleic acid sequence 
whether digitally represented or otherwise, is within the scope of this invention. Further nucleic 
acid sequence notation indicating uncertainty with respect to the identification of one or more 
bases in a nucleic acid sequence, for example TUB nomenclature such as R=G and A, Y=T and 
C, etc., can be incorporated into the method described herein and is within the scope of this 
invention. 

Nucleic acid sequences having modified or non-standard bases can be incorporated into 
the method described herein and are within the scope of this invention. For the purposes of this 
invention, a nucleic acid sequence of "bases" is an equivalent nucleic acid sequence to the 
nucleic acid sequence in which the bases are found. 

Reading frame - A "reading frame" is one of the possible phases in which one can read a 
sequence of codons (groups of three nucleotides) that can make up a coding region of DNA or 
RNA. In a codon the positions in 5' to 3' order are called the "first", "second", and "third" 
reading frames. 



Ste.es - The ' W atMbuteble to a nucleotide are U,e potential pennutations of all of the 
possible reading frames and the two nucleic acid strands included in the probability model being 
used. A is used to indicate the positive strand, and to indicate the reverse compliment 
DNA strm,d In a preferred embodiment, the possible slates of any one nucleotide are positive 
strand first reading frame (H). positive stt^d second reading frame (2+), positive strand third 
readtng ftame (3+), negative strand first reading frame (1-), negative strand second reading frame 
(2-), negative strand third reading fi™e (3-), positive strand noncoding (N+), and negative 
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strand noncoding (N-). In another embodiment, the states can be, for example, just the four 
positive states listed above. Stated symbolically, "f is an element in the set of states, i.e. f . 
{l+,2+,3+,N+, l-,2-,3-,N-}. 

5 Coding State - A "coding state" is any of the states 1+, 2+, 3+, 1-, 2-, or 3-, which indicate 
coding, i.e. nucleic acids translated into protein. 

Noncoding state - A "noncoding state" is either of the states N- or N+, both of which indicate 
noncoding, i.e. no protein translation. 



) 



Sequentially - "Sequentially" means performing a step or series of steps on nucleotides in order 
as the nucleotides occur in the nucleic acid sequence, in either direction. 

State probabilities - The "state probabilities" of a nucleotide within a nucleic acid sequence are a 
vector of probabilities associated with the given nucleotide being in each of the states. 

Window - A "window" is a contiguous and defined number of nucleotides within a nucleic acid 
sequence. For example, in a nucleic acid sequence having a length of several thousand 
nucleotides, a window of, again for example, 100 nucleotides can be defined for specific analysis 
at any place within the larger nucleic acid sequence. 

Middle Nucleotide - The "middle nucleotide" in any given nucleic acid sequence or window is 
the nucleotide found at the numerical middle of the nucleic acid sequence or window, 
respectively, wherein the length of a nucleic acid sequence or window is the total number of 
nucleotides in the nucleic acid sequence or window. If the nucleic acid sequence or window has 
an even number of nucleotides, then the middle nucleotide can be either of the two nucleotides 
ajacent the numerical middle of the nucleic acid sequence or window. For example, the middle 
nucleotide in a 101 nucleotide long window is nucleotide number 51, and the middle nucleotide 
in a 100 nucleotide long window can be either nucleotide number 50 or nucleotide number 51. 
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Oligonucleotide - An "oligonucleotide" is a a series of contiguous nucleotides with a defined 
length. 



Initial Oligonucleotide - The "initial oligonucleotide" is the oligonucleotide that occurs at the 
beginning of the nucleic acid sequence or window being examined. Therefore, the first 
nucleotide in the initial oligonucleotide is also the first nucleotide in the sequence or window. 

Transition Probability - A "transition probability" for a given nucleotide is the probability of the 
nucleotide occurring given the oligonucleotide immediately preceding that nucleotide. 

Bias Function - The "Bias Function" is a function that is used to differentialy alter the 
probability of one or more states of one or more nucleotides in a nucleic acid sequence. For 
example, if a region of the nucleic acid sequence under study is thought to be a coding region 
then the bias fimction can be used to increase the calculated probability of the coding states for 
that nucleic acid sequence. 

Bias - "Bias" is a set of one or more values that are used in the Bias Function, and is used to alter 
the probabihty of one or more states of one or more nucleotides in a nucleic acid sequence. 

Filter - A "filter" as used herein is any method or algorithm for unifying and making more 
homogeneous regions of a nucleic acid sequence that have been classified in disparate states A 
filter is used for the puxpose of more clearly defining coding region boundaries in a nucleic acid 
sequence. In a method, a step in which a filter is applied is a "filtering step." 

Class - A "class" of nucleotides is a group of nucleotides that are designated as having one state 
for the purposes of filtering. 

Positive Strand and Negative Strand - The tenns "positive strand (+)" and "negative strand (-)" 
represent complementary nucleic acid sequences. The sequence in one strand is defined by the 
sequence in the complementary strand. 
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Positive Strand State - A "positive strand state" is any of states 1+, 2+, 3+, N+. 
Negative Strand State - A "negative strand state" is any of states 1-, 2-, 3-, N-. 
Description 

The methods described herein can be performed in any mamier that allows for the 
analysis of the nucleic acid sequence under study and computation of the probabilities associated 
with that nucleic acid sequence. In a preferred embodiment, the physical nucleic acid sequence, 
for example a DNA sequence having a contiguous nucleic acid sequence of G, C, T, and A 
nucleotides, is converted into digital form by, for example, inputting the nucl Jic acid sequence 
into a computer system. The computer then processes the nucleic acid sequence using the 
methods described herein. Any nucleic acid sequence referred to herein can be arranged to have 
a begimiing and an end, and numbered so that the first nucleotide in the nucleic acid sequence is 
number 1, the next nucleotide in the nucleic acid sequence is number 2, and so on until the end of 
the nucleic acid sequence. Any other numbering scheme that is usefiil can be used. 

The methods shown in Figures 1-7 are independent, and, although several of the methods 
described can be utilized together, they can each be performed as independent methods. Further, 
where one method calls for a step in which one of the other methods can be used for that step, the 
use of the other method in the step represents only one embodiment, and other methods for 
performing the step can be used as well. 

Any probability model applicable to nucleic acid sequence state probabilities can be used 
for the probability steps if the output of the probability model sufficiently supports the method, 
including inhomogeneous Markov models that have fewer than eight states, for example, thosJ 
having only six or four states. In a preferred embodiment, the inhomogeneous Markov model 
has eight states. (For a general discussion of various models, see Durbin, et al.. Biological 
Sequence Analysis (1998), which is herein incorporated by reference in its entirety). 

Any nucleic acid sequence source can be used, regardless of the accuracy of the nucleic 
acid sequence relative to the physical molecule it represents, including raw nucleic acid sequence 
data and nucleic acid sequence data that has been changed or adjusted for other purposes, such as 
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nucleic acid sequences that have been filtered to improve accuracy, nucleic acid sequences that 
have been altered to account for known mutations, and nucleic acid sequences that have been 
engineered in any manner whatsoever, among others. Nucleic acid sequence information 
produced by automated nucleic acid sequencers can be used, as well as nucleic acid sequence 
information derived by any conventional sequencing technique, such as dideoxy sequencing, 
among others. Nucleic acid sequences produced by or from other bioinformatic processing 
methods or nucleic acid databases can be used, for example, including nucleic acid sequences 
stored in public access databases such as GenBank. Although nucleic acid sequences with any 
amount of error can be used, in a preferred embodiment the amount of sequencing error present 
is less than about 1 5%, and more preferably is less than about 1 0%. However, an advantage of 
the methods of the present invention is that they can utilize lower quality nucleic acid sequences. 
In this embodiment, the methods of the present invention can utilize nucleic acid sequences 
where the average sequence accuracy is less than 99%, more preferably less than 95%, more 
preferably less than 90, 80, or 70%. 

The present invention includes the incorporation of bias into probability models that 
determine state probabilities for one or more nucleotides. The bias is used to alter the statistical 
probability of one or more states for a nucleotide. A bias of zero, for example, will reduce the 
probability of a state to zero, while a bias of one will not alter the statistical probability. Values 
greater than one will increase the statistical probability of a state, while values between zero and 
one will reduce the statistical probability of a state. Bias can be defined by the investigator in 
order to influence the probability of states. In a preferred embodiment, bias is defined to alter the 
probability of states in a mamier consistent with existing knowledge of the nucleic acid sequence 
under study. For example, if a nucleic acid sequence has a region that is strongly suspected to be 
coding, then the nucleotides in that region can be assigned a large bias for the coding states, and 
a small bias for the noncoding states. Bias can be incorporated into any conventional statistical 
model that provides a method for determining state probabilities in order to allow for the biasing 
of statistical probabilities in that model. In one embodiment, bias can be defined for each state as 
a number equal to or greater than zero, excluding 1 . In this embodiment, the statistical 
probability of a state will be reduced if the bias is set to a number equal to or greater than zero 
and less than one, and increased if the bias is set to a number greater than one, and all states are 
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biases in one direction or the other. In another embodiment, bias can be defined as one for one or 
more states, and a number other than one for one or more states. In this embodiment, one or 
more states has a defined bias of one, which results in no biasing of the probability of that state, 
while one or more states have a defined value equal to or greater than zero, excluding one. In ' 
this embodiment, one or more states are biased, and one or more states are not. In a preferred 
embodiment, the bias is between 0.0 and 0.9 or greater than T. 1 . 

Figure 1 represents one embodiment of the method of the present invention for 
determining the state probabilities of a single nucleotide within a nucleic acid sequence. The 
nucleotide for which the state probabilities are determined can be any nucleotide in the nucleic 
acid sequence, preferably is a nucleotide close to the middle of the sequence, and in a preferred 
embodiment the nucleotide is the middle nucleotide in the nucleic acid sequence. It is preferable 
to determine state probabilities for a nucleotide at or near the middle of the nucleic acid 
sequence. State probabilities for the nucleotide are determined by first finding the probability of 
the initial oligonucleotide in the nucleic acid sequence, and then finding the transition 
probabilities for the remainder of the nucleotides in the nucleic acid sequence. The initial 
oligonucleotide probability and transition probability information is used to determine the 
probabilities of each of the states for the entire nucleic acid sequence, and the resulting state 
probabilities are assigned to the nucleofide. Eight states are described below for Figure 1, but 
those of skill in the art will readily see that fewer than eight states can be employed. 

Referring now to Figure 1, in step 12, the probability that the initial oligonucleotide 
occurs in each of the states is determined according to equation I: 



(I) 



where "a,. ..a," is an initial oligonucleotide of length k, a, is the first nucleotide in the 
oligonucleotide, N, is the set of all oligonucleotides occurring in the model sample set, and f i 
element of the set of states, which, in a preferred embodiment, is { l+,2+,3+,N+,l-,2-,3-,N-} 



is an 
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The oligonucleotide length is predefined, and can be any length for which probabilities 
can be reliably generated. Oligonucleotides can be, for example, from 2 to 100 nucleotides, 
preferably 5 to 20 nucleotides, and more preferably from 8 to 12 nucleotides in length. The 
initial oligonucleotide frequencies of all possible oligonucleotides in the model sample set , 
be, for example stored in a look up table, which is accessed as needed. A table defining the 
model sample set can be constructed, for example, by reference to sample nucleic acid sequences 
from a previously examined collection of nucleic acids, preferably from a closely related 
organism, more preferably from the same organism as the nucleic acid sequence under 
investigation. For example, sample nucleic acid sequences from Arabidopsis can be used for a 
table for investigation of nucleic acid sequences of plants such as soybean, maize, etc. Similarly, 
sample nucleic acid sequences from a chimpanzee can be used for a table for investigation of 
nucleic acid sequences of humans. By examining known nucleic acid sequences, model 
oligonucleotide frequencies in each of the states can be determined. A table can include 
indefinite or modified nucleotides, or any other nucleotide variations that occur in nucleic acid 
sequences. Alternatively, it is also possible to use estimation functions in place of such a table of 
probabilities {see, for example, Besemer, J., Borodovsky, M. (1 999) Nucl Acids Res., v.27, pp. 
391 1-3920, which is herein incorporated by reference in its entirety). 

In step 14, the transition probabilities for all nucleotides in the nucleic acid sequence after 
the initial ohgonucleotide in each of the states are determined. The transition probability is the 
probability of a nucleotide occuring given the oligonucleotide immediately preceding the 
nucleotide. The transition probability for the first nucleotide transition is set out in equation II: 



(11) 



where k is the oligonucleofide length, a, is the first nucleotide in the oligonucleotide, 
"a,..a," is the initial oligonucleotide, a,,, is the nucleotide immediately following a,, and f e 
{l+,2+,3+,N+,l-,2-,3-,N-}. Equation II determines the transition probability for the first 
nucleotide following the initial oligonucleotide. After determining the transition probability for 
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.he firs, nucleotide after *e i„i,ia, „,i„.id, the ..si.i„„ p„,a.,i,ies are de,e™i„ed 
«,ue„„a,.y for ,he remaining nuc,eo«de. in U,e nuoieic acid .<,.ence. This n,ea„. ^^ a 
-s.,,onp.,aH.i.yisde.e™i„edford,esecond„.^^^^^^^ 

based on U,e o,igonuc.eo,ide beginning a, tte second position, a,, and ending a, a.., The 
^ P---Pea.edun„U,eendof,ennc,eicacidse,„enceisreacbed. For exar^pieXe 

o..gon„c,e„nde,eng*is.en,,bena.ansi.o„probabi,i.yfor„uc,eo,idee,evenisde.e™ined 
based on *e oligonuCeoHde comprising nuc.eoUdes one trough ,en. T,^, a transition 
pro a,,,,., fo, „„„^,,, , ^ ^ ^ ^^^^^^^^^^^^^ 

andsoo„,nn«,.e,as.nnc,eo.idein.ennc,eicac.dse,.nce 

n ^''^-^"-P-'-bili.iescanbes.oredina.able.forexamp.e. The .able can be 

5 construct, for example, by reference .o sample nuCeic acid sequences from a previously 

; ~<»*n»f™,eicacid,p.ferablyfromaclosely.,a.edorganism^ 
. ^m .he same organ.sm as .he nuCeic acid under investigation. By exami„h,g known nuclei! 
J3 acd se„ model tiansition p„habi,ities in each of U,e s.a«s can be de.e™ined 

f) .s e,erm,ned by findmg fl,e produc. of ti,e pr„babili.y „f .he initial oligonuc,eo,ide and ti,e 

transition probabilities in each Of the State. Th.e . 

^^^P set forth in equation III for a model 

with eight states: 



(III) 



where the function 



i mod 3 + 1 if / = l± 

F{i) = l (^' + l)niod3+l if/=2± 
(i +2) mod 3+1 if/ = 3± 
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and 0) is the length of the nucleic acid sequence, and "a,..a," is the initial oligonucleotide. 
In step 18, the probability of each state for the nucleic acid sequence "P(f |S)" is 
determined given the probability of the nucleic acid sequence, S, in each state. A bias function, 
5 is incorporated into the equation to account for known nucleic acid sequence information. 

This step is set forth in equation IV: 



m 

ilP 

u 



15 



20 



(IV) 



p(f\s) = m-PfPf(s) 

12 <t>{f) ■ Pi • Pi{s) 

t€{l+,2+,3+.Ar+.l-,2-,3-,JV-} 



whereinP/is 1 for each coding state (1+, 2+, 3 +, l-,2-,3-)and 1 for each noncoding 
state (N+, N-). The bias function is used to modify these default values. By modifying the 
default values, the investigator can account for known nucleic acid sequence features. For 
example, if another bioinformatics process has indicated that there is a high probability that a 
certain portion of a nucleic acid sequence comprises a gene, then it would be advantageous to 
bias the state probabihties in favor of the coding states. The resulting state probabilities 
produced by the method will reflect the bias through stronger probabilities of the coding states 
relative to the noncoding states. 

If, for example, the nucleic acid sequence is known to be a coding nucleic acid sequence, 
the bias function can be defined by equation V: 



(V) 10 iif = N^ 

Equation V uses a bias of 1 for all coding states, and a bias of 0 for all noncoding states. 
The net effect will be to cause the probability of the sequence in each noncoding state to drop to 
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zero, white leaving the prol^bility of U,e sequence in the coding states tmaffected Appli 
of eqnation IV then leads to a decrease of flte probabilities of the noncoding states to zero, while 

increasing the probabilities of the coding states. 

If the nucleic acid sequence is known to be a noncoding nucleic acid sequence, then the 
5 bias function can be defined by equation VI: 
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(VI) ^^■'^ 1 1 if/ = iV± 



0 Equation VI reverses the effect of equation V. Of course, the bias function does not need 
to be binary in nature, as is shown in the above two examples, but rather can be defined in any 
manner that corresponds with known nucleic acid sequence data. A principal feature of this 
technique is that it can be used to specifically combine gene prediction infonnation from other 
sources mto biasing the results of the state probabilities algorithm shown in Figure 1 (and 
subsequent gene prediction based thereon). 

The resulting values for the probability of each state for the nucleic acid sequence can 

1 now be associated with the nucleotide for which state probabilities were being detennined. 

In a further embodiment of the method sho^^^ in Figure 1, the nucleic acid sequence is 
part of a larger nucleic acid sequence. This embodiment can be applied to any of the methods 
described herein wherein a nucleic acid sequence is used, including those represented in Figures 
) 1 through 7. 

Figure 1 shows the determination of state probabilities for a single nucleotide in a nucleic 
acd sequence. Oftentimes, however, it will be desirable to determine the state probabilities for 
more than one nucleotide in a nucleic acid sequence. 

Figure 2 represents the application of the method shown in Figure 1 to multiple 
nucleotides in a nucleic acid sequence. In order to determine the state probabilities for more than 
one nucleotide, a window is used for each nucleotide that is examined. The nucleotide that is 
bemg examined is within the window, and the probability determinations set out in equations I 
II, III, and IV are performed for the sequence in the window. The oligonucleotide probabilities 
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are dete^i^ed as befo. f„. ,he nuCeic acid se,„e„ce wiU,i„ «,e window. p™babi,i,ies f„. each 
of ,he s.a.es are de,en„i„ed for U,e nucleic acid sequence wiUnn ,he window, and Aose 
pn-babiiities a. assigned .o ,he nucleotide wiftin ^ window for which s,a,e probabiMcs are 
bemg detennined, which, in a preferred embodiment is U,e middle nucleotide. AnoUtcr 
= nucleotide is then examined, wifl, .he window shifted or .defined around the new nucleotide 
an so on. until the final nuCeofide in the nucleic acid sequence for which state probabilities Ite 
to be determined is reached. 

In steps 22. 24, 26. and 28. probabilities are detcnnined as in steps 12. 14 16 and 18 
respectively. wi«, the window in steps 22. 24. 26, and 28 corresponding to d,e nucleic acid 
sequence in steps 12, 14. 16. and 18 respectively for the pu^oses of those steps. At step28 the 
s.ate probabilities for the nucleotide for which state p„babi,i.ies a. being dete^ined are ' 
associated with that nucleotide. 

In step 30. the algorithm checks to see if the state probabilities for the last nucleotide 
have just been determined. If ye, flowproceeds to step 32 and ends. ,f in step 30 the last 
nucleotide has not been reached, flow proceeds to step 34. where the next nucleotide for which 
sue probabilities are to be determined is designated as the nucleotide to analyze in steps 22 24 
2 . an 28. After step 34. flow returns to steps 22. 24, 26, and 28, where ^e state probabilities ' 
of flte des,g„a.ed nucleotide a. dete^ined. At step 34 any nucleotide from the remaining 
nucleotides d,a. have not yet had state probabilities detennined can be designated the next 
nucleotide. 

In a prefetred embodiment. ti,e first nucleotide to be examined in step 22 is tire fi,.t 
nucleotide in a contiguous nucleic acid sequence of nucleotides for which state probabilities are 
.0 be de.e,»,ned. each subsequent nucleotide at step 34 is ti,e next nucleotide of the contiguous 
nucletc acid sequence of nucleotides for which state probabilities are to be detet^ined, and the 
last nucleotide in step 30 is ti,e last nucleotide in ti,e contiguous nucleic acid sequence of 
nucleotides for which state probabilities are to be determined. 

™e window size can he the same or different for each nucleotide, and the nucleotide can 
be located anywhere witirin its window. ,n a ptefer^d embodiment, the window size is the same 
for each nucleotide in tite nucleic acid sequence, and each nucleotide is the middle nucleotide in 
ns own wtndow. ,n one emO^iment. windows are fi^m 3 nucleotides to 1.000 nucleotides in 
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length, preferably 50 to 200 nucleotides in length, and more preferably from 75 to 125 
nucleotides in length. 

The result of the process shown in Figure 2 is the association of state probabilities with 
each individual nucleotide for which state probabilities were determined. In one embodiment, 
the nucleotides for which state probabilities are to be detemiined are a contiguous nucleic acid 
sequence of nucleotides within a longer nucleic acid sequence of nucleotides. 

Figures 3 through 7 all utilize probability models to determine state probabilities. Any 
probability model that allows for determination of the required probabilities in a plurality of 
states can be used, with use of an inhomogeneous Markov model preferred, and use of the 
inhomogeneous Markov model described above in reference to Figure 2 especially preferred. 

Figure 3 represents one embodiment of a method for determining the coding strand of a 
nucleic acid sequence. The process determines the state probabilities for each nucleotide in the 
nucleic acid sequence, sums the positive states for the nucleic acid sequence, and sums the 
negative states for the nucleic acid sequence. If the sums for the positive states and the negative 
states are sufficiently different, then the process determines that the state with the greater sum is 
the coding strand. 

In step 38, state probabilities are determined for each nucleotide in the nucleic acid 
sequence for which the coding strand is being determined. In one embodiment, state 
probabilities are determined using the inhomogeneous Markov model described above in 
reference to Figure 2. 

In step 40, the probability of each state determined in step 38 for the positive states (1+, 
2+, 3+, and N+) for each nucleotide in the nucleic acid sequence for which the coding strand is 
being determined are summed. That is, the values for the states of noncoding, positive and 
coding, positive in the first, second, and third reading frames for all nucleotides in the nucleic 
acid sequence for which the coding strand is being determined are summed. The sum is set to 
the arbitrary variable X. 

In step 42, the values determined in step 38 for the negative states 2-, 3-, N-) for each 
nucleotide in the nucleic acid sequence for which the coding strand is being determined are 
summed. That is, the values for the states of noncoding, negative and coding, negative in the 
first, second, and third reading frames for all nucleotides in the nucleic acid sequence for which 
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.he coding strand is being de.e™ined a. sunned. Ti,e sun, is se, ,o .he variable Y 

Steps 40 and 42 can be performed in reverse order. 

In step 44, a ftncUon of X and Y is used ,o de,en.i„e whetlter the state probabilities 
md,cate sufficient coding on one str^d of the nucleic acid sequence. Hat is, it is dete^ined 
whe«,er f(X,Y)<T. where T is a defined threshold vaiue. Any function can be used that allows 
for the desired discrimination. In one embodiment, the function used in step 44 is 

^^^•^^'W^y ^(-^^ ^) = |f^. the value of T is about 0.1 toaboutO.9, 

preferably is about 0.25 to about 0.75. and even mo« preferably is about 0.4 to about 0 6 If in 
step 44 the function results in a value that is less than the threshold value, T ften flow proceeds 
to step 46, where it is determined that coding is mixed or is not detectable. If in step 44 the 
fi.nct,on results in a value that is equal to or greater than d,c th^shold v^uc, T, then flow 
proceeds to step 48. 

lu step 48, it is determined on which strand coding occu,.. A fitnction of X is compar«l 
to a function of Y to determine which st^^d is coding. Any two factions that allow for the 
proper comparison can be used, including functions that weight one of the two strands ,„ one 
em^iment, = X and /(K) = r , and the comparison in step 4S simply detcnnines which 
sum is greater. If in step 48 the function of X is found to be greater tta the function of Y then 
flow proce^ls to step 50 where it is detcn^incd that coding is on the positive strand. If in .ep 4S 
.. .s dete^ined that the faction of X is not g^ter t^ Y, dren flow proceeds .0 step 52, whe. 
It IS deteimined that coding is on the negative strand. 

In another embodiment of the mea,«l represented by Figure 3, steps 44 and 46 can be 

on one strand, h this embodiment, flow bedns at «pn « „„h « 

, iiuw Dcgins at step J8 and, after executmg step 42 flow 

proceeds directly from step 42 to step 48. 

Figure 4 represents one embodiment of a meth«l for determining d,e extent of an open 
readmg ,™,e (ORF) >^.hin a nucleic acid sequence. The process determines the extent of the 
open reading frame by firs, determining the s^te p^babilities for each nucleotide in the nucleic 
actd sequence. Then, beginning from within the nucleic acid sequence, ptefentbly the 
approximate middle of the nucleic acid sequence, and proceeding toward one end of fl,e nucleic 
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ac.d sequence, the process examines each nucleotide in turn and determines whether the 
nucleotide is sufficiently likely to code. When a sufficient number of nucleotides with an 
.nsufficient likelihood of coding are encountered, the process determines that one end of the open 
-admg frame has been found. Tlte process then repeats from the middle to the other end of the 
> nucleic acid sequence in order to find the second end of the open reading flame. 

In step 56, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are detc,™ined. As stated above, any probability m«iel titat has the cornet fonn of output can 
be used, w.th an tahomogeneous Markov m«lel prefened. and the inhontogeneous Markov 
model described above and tepresented in Figure 2 most preferred. 

In step 58, the coding strand of the nucleic acid sequence is detetntined and designated 
"S.- Any algoridmt or method that can use the state probabilities produced in step 56 can be 
used, and m a prefer^d embodnnent, the method described above and teptesented in Figu^ 3 is 
used. If coding strand is indeterminate, an error can b. returned at this step and processing does 
not continue. In applications where the coding strand is already known or suspected, step 58 can 
be omttted from flte process, m which case step 56 can flow directly to step 60. 

In step 60 an arbitrary variable. L, is set to half of the length of tite nucleic acid sequence 
S, whtch designates L the middle nucleotide (determination of the middle for even and odd 
sequences is done as described above for tite middle nucleotide). In an alternative embodimem 
L can .mtially be set to any nucleotide in the nucleic acid sequence. It is preferred, however to' 
begm with L relatively close to the middle of tite putative ORF, because proper tesolution o^Ure 
ends of the ORF is then more likely. 

steps 62, 64. and 66 effectively search though the nucleic acid sequence in a descending 
dtrection from L toward the first nucleotide in tite nucleic acid sequence for one of tire ORF ends 
In step 62. the sum of the probabilities of the coding states on the strand S - drat is tire set (H 
2+. and 3.) or the set (I-, 2-, and 3-) depending on whether strand S is the positive or negative' 
stiand - for nucleotide L is detennined and compared to threshold value T. In an alternative 
embodiment, tire probability of all six coding states (H, 2+, 3+, 1, 2-. and 3-) can be combined 
If tite sum of tire coding states is greater than or equal to a tirreshold value. T. and tfte nucleotide 
.s greater than tite first nucleotide in tite nucleic acid sequence (that is, L>l), tiren L is set to L-I 
and P, an arbitrary counting variable, is set to L-l . In one embodiment, tire value of T is about 
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0.1 to about 0.9, preferably is about 0.25 to about 0.75, and even more preferably is about 0.4 to 
about 0.6. 

Flow then proceeds to step 64. If the sum of the coding states, as discussed above, is less 
than r and P is greater than 1, then P is set to P-1. The effect of the two steps, 62 and 64, is to 
reduce both L and P at the same rate if the sum of the coding states is greater than or equal to T, 
or to reduce P but not L if the sum of the states is less than T'. 

After step 64, flow proceeds to step 66, where it is determined if L-P>T" or P=l . If L- 
P>T", wherein T" is a threshold value, then a gap between the last nucleotide (L) with a sufficient 
sum of coding states and the current nucleotide being examined has increased beyond the 
threshold value T". T" can be set to any number that allows for the proper gap of noncoding 
nucleotides. T" should be larger than the maximum expected length of an intron for the nucleic 
acid sequence. This number will depend in large part on the model sample set being used. If the 
number for T" is set too low, then a relatively lengthy intron will be sufficient to fix L at the end 
of an exon that is not at the end of the ORF. If P=l, then the end of the sequence has been 
reached. In one embodiment, T" is about 10 to about 20,000 nucleotides, preferably about 50 to 
about 10,000 nucleotides, and more preferably about 500 to about 700 nucleotides. 

If neither condition in step 66 is met, then flow returns to step 62 and loops through steps 
64 and 66 until one of the conditions in step 66 is met, at which point flow proceeds to step 68. 
Steps 68, 70, 72, and 74 check for the end of the ORF in the ascending direction, and perform the 
same function as steps 60, 62, 64, and 66 but in the opposite direction. 

In step 68, M is set to the middle nucleotide. As above for L, this value can be altered in 
alternative embodiments. In step 70, the sum of the coding states, as above, is compared to T, 
and M is compared to the length of the nucleic acid sequence. If the sum of the coding states of 
nucleotide M is greater than or equal to T' and M is less than the length of the nucleic acid 
sequence, then M is set to M+1 and Q is set to M+1. Flowproceeds to step 72, where, if the sum 
of the coding states is less than T' and Q is less than the length of the nucleic acid sequence, then 
Q is set to Q+1.. Flow proceeds to step 74, where it is determined if Q-M>T", or Q> length of the 
nucleic acid sequence. If either is true, then flow proceeds to step 76, where the ORF is 
determined to extend from nucleotide L to nucleotide M. If in step 74 neither condition is true, 
then flow loops to step 70. 



30 



In an alternative embodiment, different threshold values can be used in place of P and T" 
for the second loop, which comprises steps 70, 72. and 74. Different threshold values for steps 
62, 64, and 66 versus steps 70, 72, and 74 could be desirable if, for example, one end of an ORF 
was known or suspected to be degraded to some extent. 
5 Figure 5 is a flowchart representing one embodiment of a method for determining the 

location of deletions and additions within a nucleic acid sequence. The process first determines 
the state probabilities for each nucleotide in the nucleic acid sequence. Then the process 
determines whether in the window around a specific nucleotide the most likely state for the 
nucleic acid sequence on one side of the specific nucleotide is different from the most likely state 
for the nucleic acid sequence on the other side of the specific nucleotide. If so, the process 
detennines whether a hypothetical insertion or deletion at the specific nucleotide would 
sufficiently improve the state probabilities of the entire nucleic acid sequence in the window. If 
so, then an insertion or a deletion is indicated. 

In step 78, the state probabilities of each of the nucleotides in the nucleic acid sequence is 
determmed. As stated above, any probability model that has the correct form of output can be 
used, w:th an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
described above and represented in Figure 2 most preferred. 

In step 80, the first nucleotide is designated as "Z," and the size of a window W is set 
In step 82, the probabilities of each of the states of the nucleotides between Z and the midpoint of 
the window Z+ - are averaged, and the stale with the greatest average is set to "A- (windows 
with an even or odd n^ber of nucleotides are treated as above for the middle nucleotide with 

w 

respect to determination of - ). "A" is effectively the most likely state of the first half of 

window W. 

In step 84, the probabilities of the states of the nucleotides between the midpoint of the 

w 

wmdow Z+- and the end of the window. Z+W, are avenged, and a,e state with the greatest 
average is sc. to B. B is effectively the most likely state of the second half of window W. 

In step 86, the most probable states, A and B, are checked to see if they are each a coding 
state and not the same coding state. If both A and B are coding slates and they are not the same 
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coding state, flien flow proceeds to steps 88, 90, and 92, whe,. the nucleotide at 2+2^ is 
ex^taed Arte. If, in step 86, A and B are d,e same coding state, or if one of the t^o is most 
Ptobably a noncoding state, then flow proceeds to 96, where it is deten^ined if Z is greater than 
ae length of d,e nucleic acid sequence minus f . If so, then flow p^ceeds to step 98. and the 
P«>cess ends. If, in step 96, Z is no, wiU,i„ a distance of f of the end of the nucleic acid 

sequence, then flow proceeds to step 100. where Z is increased by one. Flow then loops to step 

82. 

If in step 86 if it was determined that both conditions were met. then flow proceeds to 
steps 88 though 92 to dete^ine if either a deletion or an addition occutred a. nucleodde Z.f . 

'"^''^'''•^'''''"•'•"-'-"ageofs.a.eprobabilidesforsta.eAfortheen.itewindow 
nucleotides Z to Z.W. foran insertion is detetmined. The hypothetical average of state ' 

I P«-&rs.ateAisdetermi„edfbrthewindowasifthenucleo.deatZ.f isremoved. 

I The probabilities of state A of the nucleotides in W are averaged to obtain the hypothetical 

P average state probabilities for state A for the entire window, and the value is set to N. ,n step 90 

ahypoteicalave,ageofstatep,.babilitiesforstateAford,eentirewindow.nuc,eotidesZto ' 
Z.W. for a deletion is calculated similarly. The hypothetical average of state probabilities for ' 
s.a.e A in step 90 is determined and set to M for the window as if a nucleotide has been added on 
one side or the other of the nucleotide at Z.f . By averaging the state probabilities of all of the 

nucleotides in the window for either an insertion or a deletion, the values of N and M ..fleet the 
20 LWioodthateitheraninsertionoradeletionhastakenplace. In steps 88 and 90 inan 

alternative embodiment, state B can be used in place of state A ,„ achieve a simila^ result 

In step 92. tite larger of M and N is compated ,„ tite sum of the probabilities of the states 

indicating coding (H. 2.. 3.. I, 2-. and 3-) of the nucleotide at Z.f . ,f in step 92 neither M 

nor N is greater than the sum of fl,e ptobabilities of the coding states of the nucleotide a, Z=^. 
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then it is determined that no insertion or deletion has taken place and flow proceeds to step 96. If 
in step 92 either M or N is greater than the sum of the probabilities of the coding states of the 
W 

nucleotide at Z= y , then it is determined that an insertion or a deletion has taken place, and flow 
proceeds to step 94. 

In step 94, a deletion is indicated if N is greater than M, and an insertion is indicated if N 
is not greater than M, and flow then proceeds to step 96. 

Figure 6 is a flow chart representing one embodiment of a method for determining the 
location of one or more exons within a nucleic acid sequence and the protein translation of those 
exons. The process begins by determining the state probabilities for each nucleotide in the 
nucleic acid sequence, the coding strand, and the extent of the open reading frame. The process 
then classifies each nucleotide according to its most probable state. Filters, which reclassify 
nucleotides in a defined manner in order to make local blocks of the nucleic acid sequence 
consistent, are then applied to the nucleic acid sequence. Regions of the nucleic acid sequence 
that are in any of classes 1, 2, or 3 are then designated as exons, and the exons are translated. 
Translation is accomplished by using the universal genetic code to convert the nucleic acid 
sequence of the designated exons into the corresponding amino acid sequence based on the 
reading frame of the class. That is, exons in class 1 will be translated in reading frame 1, exons 
in class two will be translated in reading frame 2, and exons in class 3 will be translated in 
reading frame 3. The translation is linearly arranged to correspond to the linear arrangement of 
the exons along the nucleic acid sequence. 

In step 102, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are determined. As stated above, any probability model that has the correct form of output can be 
used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
described above and represented in Figure 2 most preferred. In step 104, the strand and the 
extent of the open reading frame is determined. Any method for determining the strand and the 
extent of the ORF that can use the state probabilities generated in step 102 can be used, and in a 
preferred embodiment, the methods described above and represented in Figures 3 and 4 can be 
used for such determination. 
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In s,ep 106, .he nucleotides i„ «,e nucleic acid sequence are categorized as highes, 
pro ab.„.y state a. detennined in step 102. For example, in a ntode, having four states for each 
nucleic acid strand, each nucleotide is categorized as 1 , 2, 3, or N. 

"'^''•"""'•^''"-Ptio-al.oneormorefilte.ai^applied.o.henucleicacidsequence 
in order to group adjacent nucleoddes by class. Any filter that converis portions of tiie nucleic 
acid sequence with inconsistent nucleotide classification to a more homogeneous state can be 
used. TTie net effect of the application of one or more filters to ti,e nucleic acid sequence 
classification in step 104 will be to g„up adjacent nucleotides and blocks of nucleotides into tiic 
^me coding Cassitication, thereby making exon and i„.ons mo. unifonn, and e.o„ and intron 
W boundaries more evident. 

I In step 1 10, the filtered nucleic acid sequence is analyzed for exons. Any contiguous 

m regions with coding classes of 1. 2, or 3 are dete^iined to be exons. Once each exon has been 
tdentified, the exons can be tianslated using the universal genetic code, and a .suiting protein 

sequence derived. 

5 

Figure 7 is a second embodiment of tiie metiiod described above and represented in 
Figure 6, with explicit tillering steps detailed the^in. ,n Figu. 7, steps 102, 104, ,06 and . 10 

r2TMTr,''°^''"''''''''"="''*°""'""*"^''"^'^'-'-«-'^'««.-p^ 

.12 .14, „6, 118, 120. 122, and .24 are filter steps ti.t are applied to the categorized nucleic 
acidsequenceproducedi„stepl06. The order shown for the filter steps, 112, 1,4 1,6 118 
.20 .22, and 124, can be rearranged to occur in any order in the process, and any Combination 
of «,e steps can be used, including combinations that omit one or mote of tiie filtering steps 

In step 1 12. any noncoding nucleotide flanked by two nucleotides with the same class is 

..Classified into the class of the two flanking nucleotides. For example. l.N.l would be 

converted to 1,1,1, 

In step 1 14. any nucleotide ti,at is flanked by two pairs of adjacent nucleotides all witi, 

.he same class is reclassified into the class of tite flanking nucleotides. For example. 1,2.1 

would be converted to 1,1,1,1,1. ' ' ' ' ^ 

In step 1 16. any adjacent nucleotide pair having the same class that is flanked by two 
pairs of adjacent nucleotides all with the same class is .classified into the class of the flanking 
nucleotides. Forexample. 1,1,2,2,1,1 would be converted to 1.1,1,1,1,1. 
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In step 1 1 8, any adjacent nucleotide pair having the same class that is flanked by two 
nucleotides with the same class is reclassified into the class of the flanking nucleotides. For 
example, 1,2,2,1 would be converted to 1,1,1,1. 

In step 120, any nucleotide flanked by two nucleotides with the same class is reclassified 
into the class of the flanking nucleotides. For example, 1,2,1 is converted to 1,1,1. 

In step 122, any contiguous, noncoding nucleotide region with an insufficient length is 
reclassified into the class of the flanking coding regions. An insufficient length is any length that 
IS too small to be an intron. This length will be dependent in large part upon the particular 
nucleic acid sequence under study. In one embodiment, a length of about 10 to 50, preferably 
about 20 to 40, and more preferably about 25 to 35 nucleotides in length is used. The size of the 
noncoding nucleotide length required can, in alternative embodiments, be changed as appropriate 
to better suit examination of the nucleic acid sequence under study. In step 122, the 
classification of the flanking regions of coding nucleotides can be extended into'the noncoding 
regions an equal amount on either side, an unequal amount on either side, or entirely on one side 
or the other. 

In step 124, any coding region (i.e. a region with nucleotides of classes 1, 2, or 3 
comprising more than one nucleotide classification) is reclassified as the most common class in 
that coding segment. 

Flow proceeds to step 1 10, where the filtered nucleic acid sequence is analyzed for exons 
Any contiguous regions with nucleotides of classes 1, 2, or 3 are determined to be exons Once 
each exon has been identified, the exons can be translated using the universal genetic code, and a 
resulting protein sequence derived. 

While perfonning the methods described above in Figures 1-7, windows can sometimes 
extend past the end of a sequence. Conventional applications that use window-based probability 
models for multiple nucleotides, such as the windows described above, are limited in their 
application at the ends of nucleic acid sequences. Since coding probability can be calculated 
usmg a window that is centered on each nucleotide of a nucleic acid sequence in turn, a window 
can extend beyond an end of a sequence. Figure 8a schematically represents a nucleic acid 
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sequence 200 with a window 204 of length "W." As shown in Figure 8a, the window 204 is 



W 

empty for the first — bases at an end 206 of the sequence 200. 



2 



As shown in Figure 8b, the present invention remedies this problem by using the local 
nucleic acid sequence 216 at the end 206 of the nucleic acid sequence 200 as a source for 
hypothetical nucleotides added on to the end 206 the nucleic acid sequence 206. As shown in 
Figure 8c, a copy 218 of the local nucleic acid sequence 216 can be created. As shown in Figure 
8d, the copy 218 can then be appended onto the end 206 to form a hypothetical nucleic acid 
sequence extension. As shown in Figure 8d, the window 204 is now filled with nucleotides from 
the nucleic acid sequence 200 and the hypothetical nucleic acid sequence extension 218, which 
allows for probability determination within the window 204. As shown in Figures 8b, 8c, and 
8d, the same process can be performed on the other end of the sequence at the same time.' Any 
number of nucleotides can be copied and added in this mamier in order to provide the correct size 
window. In a preferred embodiment, the number of nucleotides copied is a multiple of three. 
For example, if a 100 nucleotide window is desired for the first nucleotide in the nucleic acid 
sequence, the first 51 nucleotides of the nucleic acid sequence can be copied to form a 
hypothetical 51 nucleotide extension. When state probabilities are determined for the first 
nucleotide, the 51 appended nucleotides are used to fill the first half of the window. The same or 
different nucleotides can be copied and used in a similar mamier for any other nucleofides 
without a sufficient window. This process can be repeated for the other end of the nucleic acid 
sequence, of course, as needed. The copied nucleotides can be appended in either orientation on 
the end of the nucleic acid sequence. 

Implementation: 

A computer system capable of canying out the fimctionality and methods described 
above is shown in more detail in Figure 9a. A computer system 702 includes one or more 
processors, such as a processor 704. The processor 704 is comiected to a communication bus 
706. The computer system 702 also includes a main memory 708, which is preferably random 
access memory (RAM). Various software embodiments are described in terms of this exemplary 
computer system. After reading this description, it will become apparent to a person skilled in 
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fte relevant ar, how to implemem ,he invention ming ofter computer systems and/or computer 

architectures. 

In a further embodiment, shown in Figure 9b, the computer system can also include a 
secondary memory 710. The secondary memory 710 can include, for example, a hard disk ddve 
. 712 and/or a r«„ovable storage drive 714, representing a floppy disk drive, a magnetic tape 
dnve, or an optical disk drive, among otirers. The removable storage drive 714 reads from and/or 
wntes to a removable storage unit 7 1 8 in a well known manner. The removable stomge unit 71 8 
represents, for example, a floppy disk, magnetic tape, or an optical disk, which is read by and 
wntten to by the removable storage drive 714. As will be appreciated, the removable storage 
umt 71 8 includes a computer usable storage medium having stored therein computer software 
and/or data. 

In alternative embodiments, the secondary memory 710 may include other similar means 
for allowmg computer programs or other instructions to be loaded into the computer system 
Such means can include, for example, a removable storage unit 722 and an interface 720 
Examples of such can include a program cartridge and cartridge interface (such as that found in 
video game devices), a removable memoiy chip (such as an EPROM, or PROM) and associated 
socket, and other removable storage units 722 and interfaces 720 which allow software and data 
to be transferred from the removable storage unit 722 to the computer system. 

The computer system can also include a communications interface 724. The 
communications interface 724 allows software and data to be transferred between the computer 
system and external devices. Examples of the communications interface 724 can include a 
modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot 
and card, etc. Software and data transferred via the communications interface 724 are in the fonn 
of signals 726 that can be electronic, electromagnetic, optical or other signals capable of being 
received by the communications interface 724. Signals 726 are provided to communications 
interface via a chamiel 728. A channel 728 carries signals 726 in two directions and can be 
implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and 
other communications channels. In one embodiment, the channel is a connection to a network 
TTie network can be any network known in the art, including, but not limited to, LANs WANs 
and the Internet. Nucleic acid sequence data can be stored in remote systems, databases or 
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distributed databases, among others, for example GenBank, and transferred to computer system 
for processing via the network. In a preferred embodiment, nucleic acid sequence data is 
received through the Internet via the chamiel 728. Nucleic acid sequences can be input into the 
system and stored in the main memory 708. Input devices include the communication and 
5 storage devices described herein, as well as keyboards, voice input, and other devices for 

transferring data to a computer system. In a further embodiment, nucleic acid sequences can be 
generated by an automatic sequencer, for example any that are known in the art, and the 
implementations described herein can be incorporated within the automatic sequencer device in 
order to directly use the output of the automatic sequencer. 
I In this document, the terms "computer pn,gram medium" and "computer usable medium- 

are used to generally refer to media such as the removable storage device 718, a hard disk 
installed in hard disk drive 712, and signals 726. These computer program products are means 
for providing software to the computer system. 

Computer programs (also called computer control logic) are stored in the main memory 
708 and/or the secondary memory 710. Computer programs can also be received via the 
communications interface 724. Such computer programs, when executed, enable the computer 
I system to perform the featui^s of the present invention as discussed herein. In particular, the 
computer programs, when executed, enable the processor 704 to perform the features of the 
present invention. Accordingly, such computer programs represent controllers of the computer 
20 system. 

In an embodiment where the invention is implemented using software, the software may 
be stored in a computer program product and loaded info the computer system using the 
removable storage drive 714, the hard drive 712 or the communications interface 724. The 
control logic (software), when executed by the processor 704, causes the processor 704 to 

25 perform the ftinctions of the invention as described herein. 

In another embodiment, the invention is implemented primarily in hardware using, for 
example, hardware components such as application specific integrated circuits (ASICs) In one 
embodient incorporating ASIC technology, a self-contained device, which could be hand-held 
has integrated circuits specific fo perform the methods described above without the need for ' 

30 software. Implementation of such a hardware state machine so as fo perform the functions 
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described herein will be apparent ,o persons skilled in the relevant art(s). In yet another 
embodiment, the invention is implemented using a combination of both hardware and software 

n.e following examples are illustrative only. I, is not intended Aat the present invention 
be limited to die illustrative embodiments. 

EXAMPLE 1 

Referring now to Figures 10a, 10b, and 10c, examples ofbiasing are shown. Figure lOa 
shows a portion of genomic DNA 300. Aligned with the genomic DNA 300 is an expressed 
sequence tag (EST) 302. The EST 302 comprises c«iing regions 304 and noncodtag regions 
306. In Figure 10b a window 308 of nucleotides is examined. The window 308 is positioned on 
the genomic DNA 300 that corresponds to a known coding region 304 on the EST 302 Tlte . 
pr,on p^bability of coding is said to be 100% over that window 308 and a bias is applied 
accordingly. In Figure 10c. a different window 310 straddles the intron-exon boundaiy and the 

probability of coding is said to be 100% for the nucleotides in the window 310 that 
cotrespond to the coding region 304 of d,e EST 302, while U>e aprion probability of coding is 
said to be 0% for the nucleotides in the window 310 that correspond to the noncoding region 306 
oftheEST302. 

D Bias is applied to the two different situations shown in Figures 10b and 10c as follows 

The general equation for d,e probability of dre sequence ^= a,..a. of a Markov ptocess of order 

20 n is shown in Equation VII: 

(VII) ^(«i-aJ = P(a,...o„) . P(a„+,|tt....a„) - ...P(a^|a_„...a^_,) 



C3 
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This equation is based on an inhomogeneous Markov model, whereby the initial and 
transitional probabilities are dependent on the periodic state of the sequence (as in a hidden 
Markov model with fixed state transition probabilities). In this model, initial and transition 
probabtlities are dependent on the sequence orientation and phase in which the sequence is read 
relative to the codons in die coding portion of the nucleic acid sequence. Thus, equation VIII is 



used 
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P/iS) = Pf{ai...an) . JJ Pir(.- ,)(an+i|ai...o„+i_x) 
(VIII) i=i 



where, given a state a € {!+, 2+, 3+, N+, 1-, 2-, 3-, N-} representing the possible states 
5 for reading the sequence, wherein ... 



Ii mod 3 + 1 if / = 1± 

(i+l)mod3 + l if/ = 2* 

(i + 2) mod 3 + 1 if / = 3=^ 

CO 



M= Equation X is used to apply Bayes' rule to determine the probability that the sequence S is 

TO in state o: 



Pia\S) = ' ^"^^1 



E PiPiiS) 



»€{l + ,2+,3+,Af+,l-,2-,3-,Ar-} 



A bias function is added to equation X in order to allow for biasing of regions of DNA for 
1 5 which coding information is available. The bias function is incorporated in equation XI: 



p(a\s) = H^)-P.-PAS) 

E Ho)-Pi-PiiS) 

(XI) ie{l + ,2+,3+,jV+,l-,2-,3-,JV-} 
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Equation XI can be applied to the hypothetical region of DNA shown in the window 308 
in figure 1 Ob. Since the entirety of the sequence in the window 308 lies in a coding region (as 
determined with the EST 302), a bias ftmction 0(a) can be defined according to equation XII- 



5 (XII) 



^ 1 0 ifa6{l-,2-.3-, 



it 



which reflects that we know with 1 00% certainty that the sequence segment must be 
coding in one of the thee direct reading fi-ames, but that we do not know which. In this case, 
since 0(0) = 0 where a e {N+, 1-, 2-, 3-, N-}, equation XII can be written as equation XIII: ' 



P{a\S) = 



(XIII) 



.ie{i+.2+,3+} 



^ ifffe{l-,2-,3-,^+,jV-} 
iftre{l+,2+,3+} 



Because P„ - P^,- (since the EST does not indicate any difference in probability 
among the three reading frames), equation XIII can be simphfied as shown in equation XIV: 



P{a\S) = 



PAS) 



(XIV) 



iGfl+,2+,3+1 



_^ ifa€{l-,2-,3-,Ar+,Ar-} 
if<7e{l',2',3'} 



The function 0(0) results in a coding potential (equation XIV) substantially different than 
the unbiased coding potential function (shown by equation X). In this example, the chosen bias 
function reduces the probability of the evaluated window 308 to zero in all but the three plus- 
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s<,^d coding s,a.s. This effeciveiy forces tf,e window to be evaluated as coding i„ one of fte 
.»smve coding states, wlU.e no, biasing the probability of .hose states relative to each other (e.g.. 
^ is the same with or without the bias ftmction whereas may differ). 



' 1 + 



Fig„e 10c illustrates a window 310 wherein the evaluated sequence st,.ddles an exon- 
nttron bounda^ as indicated by the EST 302. A possible function ^o) for fltis situation would 

be to expand equation XII to equation XIII: 



(XIII) 



e ifae {l+,2+,3+} 
1-e i[a € {N+,N-} 
0 if<7 6 {l-,2-,3-} 



where i 



represents the fraction of bases in the part of the sequence in the window that lies 
m the coding region of the DNA 300 as indicated by the coding region 304 of the EST 302 If 
equation XIII is put into equation IX, equation XIV resuhs: 



' 0 

^- Pa -PAS)' 



(XIV) 



E ^{i)-Po-Pi{S) 



t -1 



(l-e)-P,.p,(5). 



E <f>{i)'P.-PiiS] 
iefi+,2+.3+,Af+,Ar-) 



jf'T€{l-,2-,3-} 
ifa6{llV2<,3'} 



Where P„= 1 for . e (N., N-) and 1 for . e (H, 2., 3.} (given the assumption that 

coding and noncoding are equiprobable events, each coding state is equiprobable with any other 

coding state, and that both noncoding states are equiprobable, ^ x 2 = 1 and 1 x 3 - 1 . 

4 2 6 2 ^' 
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EXAMPLE 2 

The following example illustrates the computations involved in probability calculations 
for a sequence with and without a bias applied. The nucleotide sequence GATGACATT is used 
in this example for clarity and simplicity, but it is understood that longer sequences as indicated 
above can be used. Further, for this example, a zero order inhomogeneous Markov model is 
used. In this model, the initial probabilities are all 1 and each event is independent of that which 
precedes it (a,...a, ^ a,,, becomes a, because k is zero). Models of higher order can be used, 
as described above. 

Accordingly, the following hypothetical table of probabilities is used: 





Direct (+) 


Reverse (-) 






1 + 


2+ 


3+ 


1- 


2- 


3- 




T 


0.13 


0.2 
7 


0.13 


0.10 


0.25 


0.21 


0.20 


C 


0.28 


0.2 
6 


0.39 


0.39 


0.21 


0.38 


0.30 


A 


0.21 


0.2 
6 


0.09 


0.13 


0.27 


0.13 


0.21 


G 


0.38 


0.2 
1 


0.39 


0.38 


0.26 


0.28 


0.29 



Without a bias function 0(a) to incorporate known information in the calculations, P(S|a) 
can be calculated for the zero order case for the sequence GATGACATT according to equations 
XV through XXI. 
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P(CATCACATT(1+) = 



(XV) 



Pirn • Pi^{G\N) . P,^iA\N) . P,UT\N). 

Pi*iA\N).p^^(T\N)-P^^mN) 
PMG)-P,,{A).p,,iT) 
PMG)'P,^{A).P:,^{C)- 
PMA) - PMT) ■ PMT) 

0.38 X 0.26 X 0.13 X 0.38 X 0.26X 
U.39 X 0.21 X 0.27 x 0.13 
3.6479448 x 10'^' 



i^(GATGACATT(2+) = 



(XVI) 



A+(G')-P3+M)P,+(r). 

P2^iG)-P,^{A).p,^(C). 

PMA)■P3*(T)■P^+(T) 

0.21 X 0.09 X 0.13 X 0.21 x 0.09x 

0.28x0.20x0.13x0.13 

5.71332739 x lO"" 



iao 

P 



P(GATGACATT13+) 



(XVII) 



n+((?)-/'i+(.4).P2+(T). 
/V(G).pi+(^).P2+(C)- 
P:i^i^) ■ PMT) • P,^iT) 
0.39 X 0.21 X 0.27 x 0.39 x 0.21 x 
0-26 X 0.09 x 0.13 x 0.27 
1.4874917 X lO-« 



15 



/*(GATGACATT|1-) = 



(XVIII) 



Pi-(G)-P2.{A)^P,.(T). 

Pt-(G)-P2~{A)p3-(cy 

Pi-{A)P2-{T)'Pz^{T) 

0.38 X 0.27 X 0.21 X 0.38 X 0.27 X 

0.38 X 0.13 X 0.25 x 0.21 

5.7332419 x lO"*' 
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(XIX) 



P(GATGACATT|2-) = /^2- (G) ■ Pg- (yl) • (T)- 

P2-{C)-P3-(>»)-Pi-(C)- 
P2-iA)-P3-{T)-Pj^{T) 

^ 0.26 X 0.13 X 0.10 X 0.26 x().13x 
U.39 X 0.27 X 0.21 x 0.10 

= 2.5262776x 10-^ 



P(GATGACATT!3 ) = Pi-(C?) • Pi-(y|) • Pj- (T)- 

P3-{A)Px-(T)P2-{T) 
= 0.28 X 0.13 X 0.25 x 0.28 x U.13 x 
0.21 X 0.13 x 0.10 X 0.25 
(XX) = 2.2607130 X 10"^ 



P(GATGACATT|Ar) = PjvCC) • P^rC/l) • PA,(r). 

PNiG)-Pr,iA)-Pi^{Cy 
Pyv(^)-Pyv(T).P;v(r) 
= 0.29 X 0.21 X 0.20 x 0.29 x 0.21 x 
0.30 X 0.21 X 0.20 X 0.20 
(XXI) = 1.8692402 x 10"^ 



Given the values of P(S|a), we can determine the probability that the given sequence 
segment is in state a, P(o|S) using equation XXII (Bayes' Rules): 

Pia\S) = ^^^^^-^^^l^) 
l^iPii) • P{S\i)] 

(XXII) > 



Equations XXIII through XXIX show the calculations for each of the states. 
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(XXIII) 



i 

i^(3.647<)448xlO-<>) 

1^13.6479448x10-0) I... -f ^(1.8692402x10-0) 

1.1060761x10-6 

0.27484131 



in 



1 1 1 



iii5 



P(2+|5) = 



(XXIV) 



(XXV) 



4. 7611081x10-° 
1.1060761 xin-fl 

0.004304501 



1.12396764x10-^ 
1.1060761x10-*" 

0.11156173 



P{l-\S) = ^ 



20 



(XXVI) 



4.7777018x10-^ 
1.1060761x10-" 

0.43195053 



P(2-|5) = 



25 (XXVII) 



2.1062313xlO-» 
1.1060761 xlO-« 

= 0.019033331 



30 



P(3-|6') = 



(XXVIII) 



1.8839275x10-^ 
1.1060761x10-6 

0.017032531 
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P(PJ\<i\ — 1-65700'ixlO-^ 
- 1.1060761xl0-fl 

(XXIX) = 0.1407G807 



10 



The coding probability fimction indicates a 43% probability that the sequence is coding in 
the first reading frame of the reverse-complement strand (-) of the sequence provided, based on 
a the zero order inhomogeneous Markov model used. While the most probable state, it is also true 
I that there is a greater probability (57%) that the sequence is not in that state, 
g An investigator can apply the bias function method to impose a bias based on 

iy prior knowledge of sequence features, such as an EST alignment to the subject sequence, or 
W homology to a previously characterized sequence. For example, given an EST alignment to the 

subject sequence that implies the sequence is coding on the positive strand, a bias function can be 
I defined that summarizes that observation. Equation XXX is one example of such a function: 



C3 



20 



25 



(XXX) ^ if^^{1^.2+.3+} 



This bias function does not exclude the possibility that the sequence is noncoding or 
coding on the reverse complement strand, although it does effectively bias the a priori 
probability that the sequence is coding in one of the forward three reading frames. The function 
above states that the three forward coding states are 19-fold (0.95/0.05) more probable than the 
other states, which is an assertion by the investigator that he is confident that the EST alignment 
is correct in indicating that the sequence is coding on that strand. 

Given the bias function defined above, the values for P'(S|a) are determined as before for 
the unbiased case. To calculate P'(a|S), however, equation XXXI is used: 
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(XXXI) 



J^im ' PH) ■ P(S\i)] 



The equations to determine P'(a|S) for each state are shown in equations XXXII through 
XXXVIII: 



G 

cB 

■ r= 

ry 



(XXXII) 



-P'(i+|5) = 



Y^i'f>(i)-Pii)'P(S\i)] 

i 

4.4399294x10-' 

0.65045095 



15 



20 



(XXXIII) 



(XXXIV) 



P'(2<|5) = 



"•^'^4.4309294x10-)' 

0.010187213 



P'(3*|5) = 



nQ5_ H--P(g|3+) _ 

"•'''*4.43&9a94xl0-7 

0.2652289 



(XXXV) 



25 



\ I / "•"'^4.4399294x10-' 

= 0.05380379 
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(XXXVI) 



P'(,2-\S) = 



0.05-te£l£JL 

4.4399294x10= 

0.0023707938 



(XXXVII) 



P'(3-|5) = 



"•"'^4.4399204x10=^ 

0.0004239267(5 



(XXXVIII) 



' ^ "■"•^4.439»294xlO-» 

= 0.0017534085 



Given the bias flmction 0(a), the resulting coding potential calculation indicates a 650/0 
probability that the sequence is coding in the first reading frame on the forward strand The 
result represents the coding probability given the assumptions of the investigator stated as the 



bias function. 



EXAMPLE 3 

The following is a copy of the output of a program implementing the method 
described above with and without a bias Sanction. The following sequence is a genomic sample 
from the organism Arabidopsis thaliana, landsberg. 



aatcaaaacgtggtatc^^SctSgtIcc^^^^ 

TTTGGCATCACACTTTCTAcS™?^^^^^^^^ 
—ACC^AAC.^^^^^^^^^^^^^ 
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3aattcatgaaact 
ptagtggtggcacc 
tgaagatcaaagtg 
:tttccagcaggta 

:GTGTTCCAAATTT( 

:'tctgatcaaaagt( 

'AATTACTCACAAC 
^GCAACTTAAAAAAi 

lTttcattaaaaca': 

'GCTTATGTCTTATC 

.tttcaatcttaaa; 
atctcaacaagaac 

^ ^ ^^^^ ^v^^GTAAATGTTTACA^ 



GTTC 
TCCA 
CTTA 
TGTGi 
^VGTAi 
TATT' 
aCAG^ 

:ata: 
tagt; 

TTTTI 



TGGCTCCi 

ZAGATGHQ 
^CCAAGT/ 
rCTCTATl 

:agagcag 

TTATCAM 
PTGTTAGT 



TTA 
^\GA 
:CA 
2TC 
\GAi 
PATi 



atttgcaga1 
aagtcacca; 

CAACTTCTCa 
AAAAACAGAG 



\ccaagtaa; 

rCTCTATTTG 



tttgtgttcttttattc 



cctcatacacgctcgcaatncgtttggaattatcagctntaatttttctaattctttggaaattattagcagctcgat" 

NO. 1) -'"^CTCATCTAAACTTTCCATGAAGAAACAAAGCT (SEQ. ID. 



CAAATGGGGCATGGCTTCTTCTTCTATCTGCAAC 



The sequence below is the same Arabidopsis sequence after coding probabilities have 
been determined without a bias, the coding strand has been determined, and each nucleotide has 
been classified in its most probable state of the four on the coding strand (dashes represent the 
state of noncoding). 



1; 



61: 111111111111313333333333333333133333333333333333333333333333 

121: 3333233333333333333333333333133333333333333333333L3L333333 

181: 3333333333333333333333333333333333133333133333333333331^^^^ 

301; ?''^^?^^^i?L^^^?!^^^L^^^^^^^^^^^^33333333333333333333^ 
361: 



333333333333333133333333333333333313333333333333333333333333 

421. 3lL^^L-3\^^^f1^^^^f ^33333333-333333333333333333^33313^^ 

4^1. -:}JJ333333333--3 — 3 — 333333333-33 

481: 



541 
601; 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 



1261. 37372^11^^^^^^^^ 

1261. 3333333333333333-33-3-3-3 — -33-33333333-333 

333 — 3 

1381: 



50 



10 



15 



1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 

2041: 

2101 
2161 



3 — 33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

—22 2222-1222222222222222221222222222222222222222222 

22222 



vex* 


AAAj. 




1 




61 


|0 


121 




181 


ry 


241 




301 




361 


25 


421 




481 




541 


ly 


601 




661 : 




721 : 


Q 


781 : 




841: 




901 : 




961: 


35 


1021: 




1081: 




1141: 




1201: 




1261: 


40 


1321: 




1381: 




1441: 




1501: 




1561: 


45 


1621-: 




1681: 




1741: 




1801: 




1861: 


50 


1921: 




1981: 




2041: 




2101: 



The classifications are now filtered. First, simple gaps are filled (XYX are reclassified as 



111111111111313333333333333333133333333333333333333333333333 

333323333333333333333333333313333333333333333333333333333333 

333333333333333333333333333333333313333313333333333333133333 

333333333133133333333133333333333333333333333333333333333133 

333333333333333133333333333333333313333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 
333333333333-3-3 333333333333 

-lllllllllllllllllllllllllllllllllimm^L^^^^^^^j^JJ^^^JJJJ^ 
llllllllllllllllllllllllllllllllllimm^Ll^^^^^-^^^^^^^^^^^^ 
lllllllllllllllllllllllliiiiiiiiiiiiiiiimm^;^^^^^^^^^^^^^ 

lllllllllllllllllllllllllllllllllllimm;^-^^^^^^^^^^^^^^^^^ 

iiiiiiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiiimm;^^^^^^^^^^^^^ 

lllllllllllllllllllllllllllllllllimm;^^^^^^^^^^^^^^^^^^^^^ 
lllllllllllllllllllllllllllllllllllimm^^^^-^^^^^^^^^^^^^^ 
llllllllllllllllllllllllllllllllllllllmm^^^3_^^^^^^^^^^^^ 

iiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiimim^^^^;^^^^^^^^^^^^^^^ 

1111111111111131111111111111131 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333-3333-3-3 333333333333333 

333—3 



3—3333 33333 3 

33313333333333 13-2222222222222222222222222222 2 

—22 2222-1222222222222222221222222222222222222222222 
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2161: 22222 



Next, XXYXX gaps are reclassified as XXXXX: 



61: 111111111111313333333333333333333333333333333333333333333333 
121: 333333333333333333333333333333333333333333333333333333333333 
181: 333333333333333333333333333333333333333333333333333333333333 
241: 333333333333333333333333333333333333333333333333333333333333 
301: 333333333333333333333333333333333333333333333333333333333333 
361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333 

481: 11 — 1111- 

541: -llllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiilimmj^l 
601: lllllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiiimi^Lljj^l 
661: lllllllllllllllllllllllllllllllliiiiniiiiiiiiiiiiimj^3^;^j3_;^ 
721: lllllllllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiimm 
781: llllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiiiiiiimm 
841: lllllllllllllllllllllllllllllllliiiiniiiiiiiiiiiiiiiimm 
901: lllllllllllllllllllllllllllllllllllliiiiiiniiiiiiiiiiiiim 
961: lllllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiiiiimim 
1021: llllllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiimmm 

1081: 1111111111111111111111111111131 

1141: 

1201: 2222222222222222222222222222-3333333333333333333333333 

1261: 3333333333333333—3333 333333333333333 

1321: 333 

1381: 

1441: 

1501: 

1561: 

1621: 

1681: 

1741: 

1801: 

1861: 

1921: 

1981: 3333 33333 3 

2041: 33333333333333 13-2222222222222222222222222222 

2101: —22 2222-1222222222222222222222222222222222222222222 

2161: 22222 



Next, XXYYXX gaps are reclassified as XXXXXX: 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333 333333333333 

11111 



52 



10 



15 

ffl 

ry 

'i s 



541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201: 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



iiiiiiiiiiiiiiiiiiiiii 



""J?!,'!?;;,"?.""""""i"iiiii"uuuiiiniiniuiini 



11111111111111111111111^^^^3^ 



^^^^ 33333 

33333333333333----l3-222222222222222222222222"2"2l2^^^^^ 

'''''^^22222222222222222222222222222222222222'2"2"2" 



22222 



f3 



Next, XYYX gaps are reclassified as XXXX: 



35 



40 



45 



50 



1 
61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661: 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 



iiliiiiii 



333333333333 



-11111 



mmmm 



1111111111111111111111111111131- 



333333S^Ll3^^1L1^^Llllllll-!fi™^^^^^ 
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1321 
1381 
1441 
1501 
5 1561 
1621 
1681 
1741 
1801 

10 1861 
1921 
1981 
2041 
2101 

15 2161 



40 



-333- 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



Next, XYX gaps are reclassified as XXX: 

1: 



iS) 61: 111111111111113333333333333333333333333333333333333333333333 

121: 333333333333333333333333333333333333333333333333333333333333 

;^ 181: 333333333333333333333333333333333333333333333333333333333333 

f- 241: 333333333333333333333333333333333333333333333333333333333333 

301: 333333333333333333333333333333333333333333333333333333333333 

p 361: 333333333333333333333333333333333333333333333333333333333333 

r- 421: 333333333333 333333333333 

a 481: ^^^^^ 

ly 541: llllllllllllllllllllllllllllllllllliiiniiiiiiiiiiiiiiiii]^!! 

601: lllllllllllllllllllllUlllllllllliiiiiiiiiiiiiiiiiimm^ll 

661: llllllllllllllllllllllllllllllllliiiiiiiiiiiiiiiiiiiii]^!]^;^!! 

721: 111111111111111111111111111111111111111111111111111111111111 

781: 111111111111111111111111111111111111111111111111111111111111 

841: 111111111111111111111111111111111111111111111111111111111111 

901: llllllllllllllllllllliiiiiiiiiiiiiiiiiiiiiiiiiiiii^i;^^^^^^^^^ 

35 961: 111111111111111111111111111111111111111111111111111111111111 

1021: 111111111111111111111111111111111111111111111111111111111111 

1081: 1111111111111111111111111111111 

1141: 

1201: 2222222222222222222222222222-3333333333333333333333333 

1261: 3333333333333333333333 333333333333333 

1321: 333 

1381: 

1441: 

1501: 

45 1561: 

1621: 

1681: 

1741: 

1801: 

50 1861: 

1921: 

1981: 3333 33333 3 

2041: 33333333333333 13-2222222222222222222222222222 
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2101 : 222222222222222222222222222222222 



2161: 22222 —" — ^^"^^^^^'^^'^^^^22222222222222222222222 



ii iiiiiiiiiiiS 

Next, the sequence is checked for frameshifts and reclassified accordingly 

iMiiiii 
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421: 333333333333333333333333333333333 

481: 

iiininiiiiiiiiiiniiiiniiiiiiiiniimnnimnnmnn 
inininiiiiiiiiiinnniiiiiiiiinniniiiiiiiiiiim^^^^ 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinniiiiiiniiiiiiiiui 
iinniiiiiinniniiniiiiiininniiiiiiinin 1 1 
11 1 iiiiiiiiniiiiiiniiiiiiinniiiiiiiniiinniiii 1111 
iiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiniiiiiiiiiiiiiiniiiii 
iiiiiiiiiiiiiiiiniiiiniiiiiiiiiiiiniiiiiiiiiiiiinJm^^^^ 

^'J^\^J^^^i^^^iiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiniiiiii ill 

lllllllllllllllllllllllllllim 



541 
601 
661 
721 
781 
841 
901 
961 
1021: 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981: 
2041 
2101 
2161 



------222222222222222222222222222222222222222233333333333^^^ 



^^33^33333333333333333333333333333333333333333 

3333333333333333333333333333333332222222222222222222^22^^^^^^ 
2222222222222222222222222222222 



Finally, the sequence is translated according to each class in each coding region, where a 
x" indicates a stop codon* ^ ^ ' 



FT 



61 • ^lll^^""^^^^^^^^^ 
21 
181 
41 
01 

42l' : (CEs'^' Ic^ 

The following sequence is the same Arabidopsis sequence used above, but with 
applied bias. Two bias functions are given by equations XXXIX and XL: 



TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 
121 : TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMK^^^ 

TTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFR 
FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEiTALA 
PSSMKIKVVAPPERKYSVWIGGSIXVPNLQMWIAKAEYXNLDRQSSTGSASDQKSPSKTR 



an 
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^ \ 0.05 if<7=JV 

(XXXIX) 

(XL) ' \ 0.95 xi<j = N ' ' J 

Where 0, is applied to a range of the DNA to which an EST has been associated, while £ 
is applied to a range of the DNA to which a gap (or intron) in the EST has been associated. 
Specifically, 0, is applied to nucleotides 1093 through 1 137 and 1219 through 1291, while 0, is 
applied to nucleotides 1138 through 1218. The probabilities are calculated with the bias, thJ 
coding strand is determined, and each nucleotide is classified as the most likely state. The 
resulting sequence is depicted below. 



1 

61 
121 
181 



111111111111313333333333333333133333333333333333333333333333 
333323333333333333333333333313333333333333333333333333333333 
o.. ^Jf ^^^^^233333333333333333333333313333313333333333333133333 
241: 333333333133133333333133333333333333333333333333333333333133 
301: 333333333333333133333333333333333313333333333333^3^3333^^^^^ 

• Ll3'^^^^:^f^f-^^^^^22233323233-33333333333333333333333333 

421 : 333333333333—3—3—333333333-33 

481: 

II]'' ! } } J\^i^iii"iiiii"iiiiiiiiiiiiiiiiiiiiiiniiiii-iiii 

: J ! J J -J^-iiiiiiiiiiiiii-iiiiiiiiiiiiiiiiiinniiiiiiii 

iiniiiiiiiii-inii-iiiniii-iiiinniiiiiiiiinnniiiiiiii 

iiiiiiiiiiiiiiiiiiiiiiiiiininiiiiiiiiinniiniiiiiiiiiiii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiniiiiiiiiiiii_^^^^ 

iiiiiiiiiiiiiiiii-iiiiiiiinniiiiiiiiiiniiii-iiiiiiiniin 
11111111111111111111111111111111111111111111111111111^1^^^^^ 
11111111111111111111111111111111111111111111111^^ 
1081: 11111111111111311111111111111311111111-1 _:_ 

1141: 

1321: —333 — 3 

1381: 

1441: 

1501: VSS.Z 

1561: 21 

1621: 

1681: 



721 
781: 
841: 
901: 
961: 
1021: 
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1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3—33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

2222-1222222222222222221222222222222222222222222 



-22 

22222 



10 



Filtering steps are then applied as before: XYX to XXX: 





1 




61 




121 


15 

m 


181 
241 
301 
361 
421 


r r^E 

ru 


481 
541 
601: 


Id- 


661: 




721: 




781: 


la 


841: 


J— 
f^i 


901: 
961: 


iu 

ilea 


1021: 
1081; 




1141: 




1201: 
1261: 
1321: 


35 


1381: 
1441: 
1501: 
1561: 
1621: 


40 


1681: 
1741: 
1801: 
1861: 
1921: 


45 


1981: 
2041: 
2101: 
2161: 



111111111111313333333333333333133333333333333333333333333333 

333323333333333333333333333313333333333333333333333333333333 

333333333333333333333333333333333313333313333333333333133333 

^^^^^S?J^^;?^?o''''"'''''''''''^^^3^333333333333333333133 

333333333333333133333333333333333313333333333333333333333333 

'''''''''''''''''"^^333333333333333333333333 
333333333333—3 — 3—333333333333 

-iiiiiiiiiiiiiniiiiiiiiiiiiiininniiiiiiiiiinnniinni 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiii;^^^^^^^^^^^^^^^^ 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinnimiiiiiiT 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiinnni 

iiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiniiiiiiiiiiiiiiiniii 
11111111111111111111111111111111111^^^^^^^^^^^^^^^^^^^^^^ 

iiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiiii^^^^^^^^^^^^^ 

11111111111111111111111111111111111111111111^^^^^^^^^^^ 

1111111111111111111111111111111111111111111111 

1111111111111131111111111111131111111111 _ _ 



, ~ 221221222122222213333333333333333333333333 

33333333333333333333333333333333333333333333333 --- 

333—3 



3 — 3333 33333 3 

33313333333333 13-2222222222222222222222222222 2 

22222 2222-1222222222222222221222222222222222222222222 
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XXYXXtoXXXXX: 



1: 



61: 111111111111313333333333333333333333333333333333333333333333 
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181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081: 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



121: 333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333^333^^^^^^^ 
3333333333333333333333333333333333333333333333^33^^^^^^ 
^^^^^^?f?^^^^^^^^^3333333333333333333333333 333^^^^^^^^ 



333333333333- 



-333333333333- 



333333333333333333333333333333333333333333333333333333333333 

■11— -1111- 
111 



-iiiiiiiiiiiiiiiinnniiiiiiiinniiiiiiiinnniiiiiin 



iiiiiiiiiiiniiiiiiiiiiniiiiiiininniiiiiinniiininin 
niiiiniiiiiiiiiniiniiiinniiiiiniiii^^ 

iiiiniiiiiiiinininniiiiiiiinnnniiiiiiiiiiniinii:: 
iiniiiiiiinniiiiiiinniiiiiiiinniiiii mm^^ 



1111 
1111 



iiiiiiiiiiiiniiiiiiiininiiiiiiiiiniiiiiiiiininiiiiiin 
iiiiiiiiiiniiiiiiiiiiiiiininiiiiiiin-----:--::L_^!^^^^ 



— -333- 



3333 33333 

33333333333333 13-2222222222222222222222222222 — - — 

22222 ^^22-^222222222222222222222222222222222222222222 



XXYYXXtoXXXXXX: 



61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 



333333333333333333333333333333333333333333333333333353^33^? 

iiniiiiiiiiiiniiiiiiniiniiiiiiniiiiiiii nn^^^^i^ 

iiiiiniiiiiiiiiiiiniiiiiiiiinniiiiininniiiiiiiinnii 
niiiinnniiiiinniiiiinnniiiiiiinnii 
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'''' "i]]]]]^^^^^^^^^ 

11 

11111111111111111111111111111^^1^^^^^^ 



961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801: 
1861 
1921 
1981 
2041 
2101 
2161 



1111111111111111111111111111111111111,^^ 1^ 

iiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiii^ 



3333 3333^ 

33333333333333- 13-2222222222222222222222222222 

22222 ^^22^2222222222222222222222222222222222222222222 



XYYXto XXXX: 



61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201, 
1261 
1321 
1381 
1441 
1501 
1561 
1621 



111111111111313333333333333333333333333333333333333333333^3^ 
3333333333333333333333333333333333333333333333^3^^^^ 

333333333333333333333333333333333333333333333^^^^^^^^ 
33333333333333333333333333333333333333333333333^^^^^^^^ 
3333333333333333333333333333333333333333333333333^^^^^^^^ 



111 



iiiiiiiiiiiiiiiiinnniiiiiiiiiiiiiiiiiiiiniiiiiiinin 

iiiiiniiiiniinnnniiiinnniiiiiinniiiiim^^^^ 
1111 iniiniiiiiiiniinininniiiiiiinnniiim^^^^ 
11 iiiiiiniiiiiiiniiiiiiiiiinniiiniinnniiimm^^ 



iiiiiiiiiiiiiiiiiiiiiiinniiiiiiiniiiiiiiniiniiiiiiinii 
11 iiiiiiiiiiiiiiiiiinniiiiniiiiiiiiinin iiiin 
niiiinininiiiiiiiinniiiiiiinniiiiiiiiinnm^^^^ 

iiiiiiiiiiiniiiiinnniniiiinnniiiiii nn^ n^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiniiinniiiiii ^^^-^^1111111111 
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1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 



22222 



10 



XYX to XXX: 
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61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441: 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981; 
2041 
2101 
2161 



111111111111113333333333333333333333333333333333333333333333 
3333333333333333333333333333333333333333333333333^333^^^^^^ 

33333333333333333333333333333333333333333333333333^^^^^^^^ 
333333333333333333333333333333333333333333333333333333^^^^^^ 
3333333333333333333333333333333333333333333333333333^3^^^ 

333333333333 333333333333 



11111111111111111111111111111] 



11111 

T . 1111111111111111111111111111111 

iiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiniiiiiiiinniniiiiim^^ 

iiiiiiiiiiiiiiiniiniiiiiiiiiiiiiniiiiiiiiiini 
iiiiiiiiiiiiiiiiiiiiniinniiiiiiiiiiiiiiniiiiiiii ; 

iiiniiiiiiiiiiiiniiiiiiiiiiniiiiiiiiiinniiiiiiniiiiin 

iiiiiiiiiiiiiiiiiniiniiiiiiiiiiiininniiiiiiiiii inii 1 
iiiiiiiiiiiiiiiiiiiiinniiiiiinnninnniiiii,,\\\\\^^^^^^^^^ 

iiiiiiiiiiiiiiiiiniiiiiiiiiiiniiiiiiiiiiinii 11 1 m 



iiiiiiiiiiiiiiiiiiiiiiiiiiiiimm^^^^^; 



.11 



3333 33332 ^ 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 



22222 



Gaps between coding regions that are not introns are filled as before: 



1: 



61 



111111111111113333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333 

iiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiimm;^i^^^^^^^^^^^^^JJJJJ 

llllllllllllllllllllllllliiiiiiiiiiiiiiiiiiiillll^^^^jj^^^^^ 
llllllllllllllllllllllllllllllllllllllimi-L^^-^^-^^^^^^^^^^^^ 

111111111111111111111111111111111111111111111111111111111111 
lllllllllllllllllllllllllllllUlllllim^l^^^jj^^^^^^^^^^^^^ 

iiiiiiiiiiiiiiimiiiiiiiiiiiiiiiiiiiimii^^^^^^^^^^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimm^^j^^^^^^^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiim-^ii^^^^^^^^^^^^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiij^^^^^^^^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimj^ 

:::: 222222222222222213333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

^ 3 *3 3 3 3 "~ ~~ ~ ~~ *~ — — ~~ ~— — — — — — _ 



61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 



3333333333333333333333333333333333333333333333 
333333333333333311132222222222222222222222222222222222222222 
222222222222222222222222222222222222222222222222222222222222 

Frameshifts are verified and nucleotides are reclassified accordingly: 

llllllllllllllllllllllllllllllllllllllllllllm;L^^^^^^~~~~j 
llllllllllllllllllllllllllllllllllmm^^^^^j^^^^^^^^^^^^^^ 
llllllllllllllllllllllllllllllmi;^^^;^^^^^^^^^^^^^^^^^^^^^^^ 

111111111111111111111111111111111111111111133333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333 

llllllllllllllllllllllllllllllllllllim^^^^^^^^^^^^^^^JJJJJ 

lllllllllllllllllllllllllllllllllllim^li;^^^^^^^^^^^^^^^^^^ 
lllllllllllllllllllllllllllllllllllllimm^^^^^j^^^^^^^^^^ 
1111111111111111111111111111111111111^^^^^^3^^^^^^^^^^^^^^^ 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiuiiiiiimimmm-^i^^^^^^^^^ 



62 




^ 

A 



h 

IV 



10 



15 

C3 

m 

s25 



C3 
iy 

no 



35 



40 



841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



111111111111111111111111111111111111111111111111111111111111 
111111111 11 111111111111 11111111111111 11 1111111 111 11 111^-^^^^^ 

111111111111111111111111111111111111111111111111111113^^^-^^^^ 

11111111111111111111111111111111111111111111111111111-^ -Llllll 
1111111111111111111111111111111111111111 

222222222222222222222222222233333333333333 

333333333333333333333333333333333333333333333333333333333333 
333333 



3333333333333333333333333333333333333333333333 

333333333333333333333333333333333222222222222222222222222222 

222222222222222222222222222222222222222222222222222222222222 
22222 



61 
121 
181 
241 
301 
361 
421 



And the sequence is translated as before: 

XRFFRALxAVLATPVxWLGWDKRMLMLETRLNQNVVSxLxSTQLSMELLIIGMTWRRFGI 

TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 

TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILTERGYSFT 

TTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFRCPEVL 

FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEITALA 

PSSMKIKVVAPPERKYSVWIGGSILASXQMWIAKAEYXNLDRQSSTGSASDQKSPSKTRA 

VKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASSSICNSSKLSMKKO 
SX (JEQ. ID, N O . J ' ) 



45 



|A3 ^^)> 

The resulting amino acid sequence (GEQ. ID. NO. 3) differs from the amino acid 
sequence calculated without a bias ^SEQ.4DJ^^O^. The relative accuracy of the two amino 
acid sequences can be determined by comparison to a known sequence. ^SEQ. N(?^S^ 
8EQ. IDr-f^Or^are compared to the translation of the actin gene from Arabidopsis thaliana, 
Columbia (SEQ. ID. NO, ^. Dashes indicate gaps in the sequence and asterisks indicate a match 
among all three sequences. The predicted amino acid sequences (SEQ. ID. :^JOs. 2l U id " 3 ^ are 
based on Arabidopsis thaliana, landsberg ecotype. A comparison of the predicted with a 
known Arabidopsis thaliana, Columbia ecotype amino acid sequence (SEQ. ID. NO. Jpf is shown 
below. The sequence set forth in Box A illustrates an area of the biased sequence that shows a 
higher level of identity with the Arabidopsis thaliana, Columbia sequence. 
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unbiased 
biased 
5 Columbia 



-XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNWSX— LXSTQLSMELLIIG— M 
-XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNVVSX— LXSTQLSMELLIIG— M 
GDDAPRAVFPSIVGRPR-HTGVMVGMGQKDAYVGDEAQSKRGILTLKYPIEHGIVNNWDD 



unbiased 
biased 
10 Columbia 



TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
MEKIWHHTFYNELRVAPEEHPVLLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 



unbiased 
biased 
15 Columbia 



%Q 



unbiased 
biased 
f^O Columbia 

ru 

unbiased 
biased 
r25 Columbia 



unbiased 
biased 
'^SO Columbia 



unbiased 
biased 
35 Columbia 



LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 
LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 
L-ASGRTTGG IVLDSGDGVSHTVPI YEGYALPHAILRLDLAGRDLTDHLMKILT 



k k k -k -ie -k k 



ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 
ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 
ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 

FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFGGIGDRMS 

Box A 



KEITALAPSSMKIKVVAPPERKYSV/^IGGSIX- 
KEITALAPSSiyiKIKVVAPPERKYSV/^IGGSILAS 

KEITALAPSSMKIKVVAPPERKYSV^IGGSILASLSTFQQMQMWIAKAEY DESG 



VPNLQMWIAKAEYXNLDRQSSTG 
XQMWIAKAEYXNLDRQSSTG 



SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 

SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 
P3 IVHRKCF 



unbiased 
biased 
40 Columbia 



SICNSSKLSMKKQSX 
SICNSSKLSMKKQSX 
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